Lab 19: Text Processing — grep, awk, sed

Time: 30 minutes | Level: Practitioner | Docker: docker run -it --rm ubuntu:22.04 bash


Overview

The Unix text processing triad — grep, awk, and sed — is your Swiss Army knife for log analysis, data transformation, and automation. In this lab you will master extended regular expressions, field-based processing, stream editing, and build real-world log analysis pipelines.

Prerequisites: Docker installed, Labs 01–15 completed.


Step 1: Set Up Sample Log Data

All exercises use a realistic web access log. Let's create it first.

docker run -it --rm ubuntu:22.04 bash

cat > /tmp/access.log << 'EOF'
192.168.1.10 - alice [15/Jan/2024:08:00:01 +0000] "GET /api/users HTTP/1.1" 200 1234
192.168.1.20 - bob [15/Jan/2024:08:00:02 +0000] "POST /api/login HTTP/1.1" 401 89
10.0.0.5 - - [15/Jan/2024:08:00:03 +0000] "GET /api/reports HTTP/1.1" 500 456
192.168.1.10 - alice [15/Jan/2024:08:00:04 +0000] "DELETE /api/users/42 HTTP/1.1" 403 78
10.0.0.6 - charlie [15/Jan/2024:08:00:05 +0000] "GET /health HTTP/1.1" 200 12
192.168.1.30 - dave [15/Jan/2024:08:00:06 +0000] "PUT /api/config HTTP/1.1" 200 567
192.168.1.20 - bob [15/Jan/2024:08:00:07 +0000] "GET /api/users HTTP/1.1" 200 1190
10.0.0.7 - eve [15/Jan/2024:08:00:08 +0000] "POST /api/upload HTTP/1.1" 413 234
192.168.1.10 - alice [15/Jan/2024:08:00:09 +0000] "GET /api/export HTTP/1.1" 200 98765
10.0.0.5 - - [15/Jan/2024:08:00:10 +0000] "GET /api/reports HTTP/1.1" 500 456
EOF

wc -l /tmp/access.log
echo "Log created successfully"

📸 Verified Output:

💡 Keep a test dataset for practicing text processing. Real log files can be huge. Before running sed -i (in-place edit) on a production log, always test on a copy. Use cp /var/log/nginx/access.log /tmp/test.log to make a safe copy.


Step 2: grep — Pattern Matching Mastery

grep filters lines matching a pattern. -E enables Extended Regular Expressions (ERE); -P enables Perl-compatible regex (PCRE).

📸 Verified Output:

grep flags reference:

Flag
Meaning

-E

Extended regex (alternation `

-P

Perl regex (lookaheads, \d, \s, \b)

-i

Case-insensitive

-v

Invert match

-c

Count matching lines

-n

Show line numbers

-l

List files with matches

-r

Recursive directory search

-o

Print only the matched part

-A 3

Show 3 lines after match

-B 2

Show 2 lines before match

-C 2

Show 2 lines context (before + after)

💡 grep -o extracts just the matching part. Combined with sort and uniq, it's powerful: grep -oE '\b[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\b' access.log | sort | uniq -c | sort -rn extracts and counts all IP addresses.


Step 3: awk — Field-Based Processing

awk processes text field by field. It's a complete programming language built for columnar data.

📸 Verified Output:

💡 awk uses $0 for the entire line, $1$NF for fields, NR for line number, NF for field count. Change the field separator with -F ':' for colon-delimited files (like /etc/passwd): awk -F: '{print $1, $3}' /etc/passwd prints usernames and UIDs.


Step 4: awk Advanced — Aggregation & Reporting

📸 Verified Output:

💡 awk arrays are associative (hash maps). You can accumulate any key-value data: awk '{sum[$1]+=$10} END{for(ip in sum) print ip, sum[ip]}' access.log gives total bytes per IP. Arrays are automatically created when first referenced — no declaration needed.


Step 5: sed — Stream Editor for Transformations

sed edits text streams line by line using commands.

📸 Verified Output:

💡 sed -i edits files in-place — always test first without -i. On macOS, sed -i requires an extension argument: sed -i '' 's/old/new/' file. On Linux, sed -i 's/old/new/' file works directly. Use sed -i.bak to create a backup before editing.


Step 6: sed Advanced — In-place Editing & Config Management

📸 Verified Output:

💡 Use | as a delimiter in sed when the pattern contains /. sed 's|/old/path|/new/path|g' avoids escaping slashes. You can use any character: sed 's#old#new#g' works too. This is essential when editing file paths or URLs.


Step 7: Combining grep + awk + sed in Pipelines

📸 Verified Output:

💡 Build pipelines incrementally. Start with cat file, add | grep pattern, check output, add | awk ..., check again. Never write a 5-stage pipeline from scratch — build and verify each stage. Use | head -5 to preview without processing everything.


Step 8: Capstone — Complete Log Analysis Script

Scenario: Build a production-ready log analyzer that generates an HTML-friendly report.

📸 Verified Output:

💡 This script is a foundation for a real log monitoring tool. Add --since filtering with awk '$4 > "[15/Jan/2024:08:00:05"', email output with | mail -s "Log Report" [email protected], or schedule with cron. The grep+awk+sed combination handles any structured text file — Apache logs, nginx logs, custom app logs.


Summary

Tool
Best For
Key Flags

grep

Finding lines matching a pattern

-E (ERE), -P (PCRE), -v (invert), -c (count), -n (line nums), -o (match only)

grep -E

Extended regex: +, ?, |, {n}

'pat1|pat2'

grep -P

Perl regex: \d, \s, \b, lookaheads

'\b[45]\d{2}\b'

awk

Field processing, aggregation, reporting

-F (delimiter), BEGIN/END, arrays

awk patterns

Conditional processing

$9 >= 400 {print}

sed 's/a/b/g'

Global substitution

g = all occurrences

sed -n '/pat/p'

Print only matches

-n suppresses default output

sed '/pat/d'

Delete matching lines

sed -i

In-place file edit

Use .bak extension for safety

| pipe

Chain tools together

Build incrementally

Last updated