Chapter 10: Text Processing
The Unix philosophy states: "Write programs that handle text streams, because that is a universal interface." In this chapter, we master the tools that make the shell an industrial-strength data processing engine.
I. Regular Expressions (Regex)
Regex is a language for describing patterns in text. Most text tools (grep, sed, awk) use them.
| Pattern | Meaning | Example |
|---|---|---|
^ | Start of a line | ^Error (Starts with Error) |
$ | End of a line | done$ (Ends with done) |
. | Any single character | b.t (bit, bat, bet) |
[a-z] | Any character in range | [0-9] (Any digit) |
* | Zero or more of previous | a* ( , a, aa, aaa) |
+ | One or more of previous | a+ (a, aa, aaa) |
II. grep: The Pattern Finder
grep is for filtering lines.
# Search for "failed" in all logs, case-insensitive, with line numbers
grep -in "failed" /var/log/*.log
# Search for lines that DO NOT contain "success" (Invert search)
grep -v "success" results.txt
# Extended Regex (-E): Search for multiple patterns
grep -E "Error|Critical|Fatal" system.log
III. sed: The Stream Editor
sed is primarily used for search-and-replace, but it is a complete non-interactive editor.
1. Basic Substitution
# Replace 'old' with 'new' (first occurrence only)
sed 's/old/new/' file.txt
# Replace 'old' with 'new' globally (all occurrences)
sed 's/old/new/g' file.txt
2. Using Different Delimiters
If your text contains slashes (like URLs), you can use any character as a delimiter.
# Standard (hard to read):
sed 's/http:\/\/localhost/https:\/\/api.production.com/g'
# Using '#' as delimiter (much cleaner):
sed 's#http://localhost#https://api.production.com#g'
IV. awk: The Data Processor
awk treats a file as a structured database of rows and columns.
1. Anatomy of an Awk Script
An awk script consists of three parts:
BEGIN: Runs once before any lines are read.- The Body: Runs for every line.
END: Runs once after all lines are processed.
2. Powerful Awk One-liners
# Print users with UID (column 3) greater than 1000 from /etc/passwd
awk -F: '$3 > 1000 {print $1, $3}' /etc/passwd
# Sum the sizes of all files in a directory
ls -l | awk '{sum += $5} END {print "Total size:", sum}'
V. Sorting and Uniqueness
sort: Sorts lines alphabetically or numerically (-n).uniq: Removes duplicate lines. Crucial:uniqonly works on sorted data.cut: Extracts specific characters or fields.
# Find the top 5 most frequent IP addresses in an access log
awk '{print $1}' access.log | sort | uniq -c | sort -rn | head -n 5
In the next chapter, we'll learn how to monitor and control system resources using Process Management.