Chapter 10: Text Processing

The Unix philosophy states: "Write programs that handle text streams, because that is a universal interface." In this chapter, we master the tools that make the shell an industrial-strength data processing engine.

I. Regular Expressions (Regex)

Regex is a language for describing patterns in text. Most text tools (grep, sed, awk) use them.

Pattern	Meaning	Example
`^`	Start of a line	`^Error` (Starts with Error)
`$`	End of a line	`done$` (Ends with done)
`.`	Any single character	`b.t` (bit, bat, bet)
`[a-z]`	Any character in range	`[0-9]` (Any digit)
`*`	Zero or more of previous	`a*` ( , a, aa, aaa)
`+`	One or more of previous	`a+` (a, aa, aaa)

II. `grep`: The Pattern Finder

grep is for filtering lines.

# Search for "failed" in all logs, case-insensitive, with line numbers
grep -in "failed" /var/log/*.log

# Search for lines that DO NOT contain "success" (Invert search)
grep -v "success" results.txt

# Extended Regex (-E): Search for multiple patterns
grep -E "Error|Critical|Fatal" system.log

III. `sed`: The Stream Editor

sed is primarily used for search-and-replace, but it is a complete non-interactive editor.

1. Basic Substitution

# Replace 'old' with 'new' (first occurrence only)
sed 's/old/new/' file.txt

# Replace 'old' with 'new' globally (all occurrences)
sed 's/old/new/g' file.txt

2. Using Different Delimiters

If your text contains slashes (like URLs), you can use any character as a delimiter.

# Standard (hard to read):
sed 's/http:\/\/localhost/https:\/\/api.production.com/g'

# Using '#' as delimiter (much cleaner):
sed 's#http://localhost#https://api.production.com#g'

IV. `awk`: The Data Processor

awk treats a file as a structured database of rows and columns.

1. Anatomy of an Awk Script

An awk script consists of three parts:

BEGIN: Runs once before any lines are read.
The Body: Runs for every line.
END: Runs once after all lines are processed.

2. Powerful Awk One-liners

# Print users with UID (column 3) greater than 1000 from /etc/passwd
awk -F: '$3 > 1000 {print $1, $3}' /etc/passwd

# Sum the sizes of all files in a directory
ls -l | awk '{sum += $5} END {print "Total size:", sum}'

V. Sorting and Uniqueness

sort: Sorts lines alphabetically or numerically (-n).
uniq: Removes duplicate lines. Crucial: uniq only works on sorted data.
cut: Extracts specific characters or fields.

# Find the top 5 most frequent IP addresses in an access log
awk '{print $1}' access.log | sort | uniq -c | sort -rn | head -n 5

In the next chapter, we'll learn how to monitor and control system resources using Process Management.

Text Processing