NANDHOO.

Chapter 10: Text Processing


The Unix philosophy states: "Write programs that handle text streams, because that is a universal interface." In this chapter, we master the tools that make the shell an industrial-strength data processing engine.


I. Regular Expressions (Regex)


Regex is a language for describing patterns in text. Most text tools (grep, sed, awk) use them.


PatternMeaningExample
^Start of a line^Error (Starts with Error)
$End of a linedone$ (Ends with done)
.Any single characterb.t (bit, bat, bet)
[a-z]Any character in range[0-9] (Any digit)
*Zero or more of previousa* ( , a, aa, aaa)
+One or more of previousa+ (a, aa, aaa)

II. grep: The Pattern Finder


grep is for filtering lines.


# Search for "failed" in all logs, case-insensitive, with line numbers
grep -in "failed" /var/log/*.log

Search for lines that DO NOT contain "success" (Invert search)

grep -v "success" results.txt


Extended Regex (-E): Search for multiple patterns

grep -E "Error|Critical|Fatal" system.log


III. sed: The Stream Editor


sed is primarily used for search-and-replace, but it is a complete non-interactive editor.


1. Basic Substitution

# Replace 'old' with 'new' (first occurrence only)
sed 's/old/new/' file.txt

Replace 'old' with 'new' globally (all occurrences)

sed 's/old/new/g' file.txt


2. Using Different Delimiters

If your text contains slashes (like URLs), you can use any character as a delimiter.

# Standard (hard to read):
sed 's/http:\/\/localhost/https:\/\/api.production.com/g'

Using '#' as delimiter (much cleaner):

sed 's#http://localhost#https://api.production.com#g'


IV. awk: The Data Processor


awk treats a file as a structured database of rows and columns.


1. Anatomy of an Awk Script

An awk script consists of three parts:

  1. BEGIN: Runs once before any lines are read.
  2. The Body: Runs for every line.
  3. END: Runs once after all lines are processed.

BEGIN { }BODY { /pattern/ action }Row 1: item1 item2...Row 2: item1 item2...Row n: item1 item2...END { }


2. Powerful Awk One-liners

# Print users with UID (column 3) greater than 1000 from /etc/passwd
awk -F: '$3 > 1000 {print $1, $3}' /etc/passwd

Sum the sizes of all files in a directory

ls -l | awk '{sum += $5} END {print "Total size:", sum}'


V. Sorting and Uniqueness


  • sort: Sorts lines alphabetically or numerically (-n).
  • uniq: Removes duplicate lines. Crucial: uniq only works on sorted data.
  • cut: Extracts specific characters or fields.

# Find the top 5 most frequent IP addresses in an access log
awk '{print $1}' access.log | sort | uniq -c | sort -rn | head -n 5

In the next chapter, we'll learn how to monitor and control system resources using Process Management.