NANDHOO.

Chapter 10: Text Processing

The Unix philosophy states: "Write programs that handle text streams, because that is a universal interface." In this chapter, we master the tools that make the shell an industrial-strength data processing engine.

I. Regular Expressions (Regex)

Regex is a language for describing patterns in text. Most text tools (grep, sed, awk) use them.

PatternMeaningExample
^Start of a line^Error (Starts with Error)
$End of a linedone$ (Ends with done)
.Any single characterb.t (bit, bat, bet)
[a-z]Any character in range[0-9] (Any digit)
*Zero or more of previousa* ( , a, aa, aaa)
+One or more of previousa+ (a, aa, aaa)

II. grep: The Pattern Finder

grep is for filtering lines.

# Search for "failed" in all logs, case-insensitive, with line numbers
grep -in "failed" /var/log/*.log

# Search for lines that DO NOT contain "success" (Invert search)
grep -v "success" results.txt

# Extended Regex (-E): Search for multiple patterns
grep -E "Error|Critical|Fatal" system.log

III. sed: The Stream Editor

sed is primarily used for search-and-replace, but it is a complete non-interactive editor.

1. Basic Substitution

# Replace 'old' with 'new' (first occurrence only)
sed 's/old/new/' file.txt

# Replace 'old' with 'new' globally (all occurrences)
sed 's/old/new/g' file.txt

2. Using Different Delimiters

If your text contains slashes (like URLs), you can use any character as a delimiter.

# Standard (hard to read):
sed 's/http:\/\/localhost/https:\/\/api.production.com/g'

# Using '#' as delimiter (much cleaner):
sed 's#http://localhost#https://api.production.com#g'

IV. awk: The Data Processor

awk treats a file as a structured database of rows and columns.

1. Anatomy of an Awk Script

An awk script consists of three parts:

  1. BEGIN: Runs once before any lines are read.
  2. The Body: Runs for every line.
  3. END: Runs once after all lines are processed.

BEGIN { }BODY { /pattern/ action }Row 1: item1 item2...Row 2: item1 item2...Row n: item1 item2...END { }

2. Powerful Awk One-liners

# Print users with UID (column 3) greater than 1000 from /etc/passwd
awk -F: '$3 > 1000 {print $1, $3}' /etc/passwd

# Sum the sizes of all files in a directory
ls -l | awk '{sum += $5} END {print "Total size:", sum}'

V. Sorting and Uniqueness

  • sort: Sorts lines alphabetically or numerically (-n).
  • uniq: Removes duplicate lines. Crucial: uniq only works on sorted data.
  • cut: Extracts specific characters or fields.
# Find the top 5 most frequent IP addresses in an access log
awk '{print $1}' access.log | sort | uniq -c | sort -rn | head -n 5

In the next chapter, we'll learn how to monitor and control system resources using Process Management.