Chapter 16: Compression Algorithms (RLE, Huffman Coding)

Introduction

Data compression reduces file sizes by identifying and eliminating redundancy. This chapter explores fundamental compression algorithms including Run-Length Encoding (RLE), Huffman Coding, and dictionary-based methods. Understanding compression is essential for efficient storage, faster transmission, and optimal resource utilization.

Why This Matters

Compression powers everyday technology:

File formats: ZIP, PNG, JPEG, MP3, MP4
Web performance: GZIP compression for HTTP
Video streaming: Netflix, YouTube compression
Databases: Compressed storage saves space
Network protocols: Reduced bandwidth usage
Backups: Efficient storage of archival data
Text processing: Log file compression

How to Study This Chapter

Understand lossless vs lossy - When data must be preserved exactly
Trace compression by hand - Work through encoding/decoding examples
Implement algorithms - RLE, Huffman, LZW
Measure compression ratios - Compare algorithm effectiveness
Analyze trade-offs - Compression ratio vs speed vs memory

Lossless vs Lossy Compression

Lossless Compression

Lossless: Original data can be perfectly reconstructed.

Examples: ZIP, PNG, FLAC, Huffman, RLE

Use cases: Text files, executables, medical images

Lossy Compression

Lossy: Some information is permanently lost.

Examples: JPEG, MP3, MP4

Use cases: Photos, music, video (where perfect accuracy isn't critical)

Run-Length Encoding (RLE)

RLE compresses sequences of repeated values.

Algorithm

Encode: Replace runs of identical values with (count, value)

"AAAAAABBBCCCCC" → "6A3B5C"

When RLE Works Well

Images with large uniform areas
Simple graphics
Fax transmissions
Screen recordings with static regions

C Implementation

#include <stdio.h>
#include <string.h>

void rleEncode(const char* input, char* output) {
    int len = strlen(input);
    int j = 0;

    for (int i = 0; i < len; i++) {
        int count = 1;

        // Count consecutive identical characters
        while (i + 1 < len && input[i] == input[i + 1]) {
            count++;
            i++;
        }

        // Write count and character
        j += sprintf(output + j, "%d%c", count, input[i]);
    }

    output[j] = '\0';
}

void rleDecode(const char* input, char* output) {
    int len = strlen(input);
    int j = 0;

    for (int i = 0; i < len; ) {
        int count = 0;

        // Read count
        while (i < len && input[i] >= '0' && input[i] <= '9') {
            count = count * 10 + (input[i] - '0');
            i++;
        }

        // Read character and repeat
        if (i < len) {
            char ch = input[i];
            for (int k = 0; k < count; k++) {
                output[j++] = ch;
            }
            i++;
        }
    }

    output[j] = '\0';
}

int main() {
    char input[] = "AAAAAABBBCCCCC";
    char encoded[100];
    char decoded[100];

    printf("Original: %s\n", input);

    rleEncode(input, encoded);
    printf("Encoded: %s\n", encoded);
    printf("Compression ratio: %.2f%%\n",
           100.0 * strlen(encoded) / strlen(input));

    rleDecode(encoded, decoded);
    printf("Decoded: %s\n", decoded);

    return 0;
}

C++ Implementation

#include <iostream>
#include <string>
#include <sstream>
using namespace std;

string rleEncode(const string& input) {
    if (input.empty()) return "";

    stringstream encoded;
    int count = 1;

    for (size_t i = 1; i <= input.length(); i++) {
        if (i < input.length() && input[i] == input[i - 1]) {
            count++;
        }
        else {
            encoded << count << input[i - 1];
            count = 1;
        }
    }

    return encoded.str();
}

string rleDecode(const string& input) {
    stringstream decoded;
    int count = 0;

    for (char ch : input) {
        if (isdigit(ch)) {
            count = count * 10 + (ch - '0');
        }
        else {
            for (int i = 0; i < count; i++) {
                decoded << ch;
            }
            count = 0;
        }
    }

    return decoded.str();
}

int main() {
    string input = "AAAAAABBBCCCCC";

    cout << "Original: " << input << endl;

    string encoded = rleEncode(input);
    cout << "Encoded: " << encoded << endl;
    cout << "Compression ratio: " << (100.0 * encoded.length() / input.length()) << "%" << endl;

    string decoded = rleDecode(encoded);
    cout << "Decoded: " << decoded << endl;

    return 0;
}

Java Implementation

public class RunLengthEncoding {
    public static String encode(String input) {
        if (input == null || input.isEmpty()) {
            return "";
        }

        StringBuilder encoded = new StringBuilder();
        int count = 1;

        for (int i = 1; i <= input.length(); i++) {
            if (i < input.length() && input.charAt(i) == input.charAt(i - 1)) {
                count++;
            } else {
                encoded.append(count).append(input.charAt(i - 1));
                count = 1;
            }
        }

        return encoded.toString();
    }

    public static String decode(String input) {
        StringBuilder decoded = new StringBuilder();
        int count = 0;

        for (char ch : input.toCharArray()) {
            if (Character.isDigit(ch)) {
                count = count * 10 + (ch - '0');
            } else {
                for (int i = 0; i < count; i++) {
                    decoded.append(ch);
                }
                count = 0;
            }
        }

        return decoded.toString();
    }

    public static void main(String[] args) {
        String input = "AAAAAABBBCCCCC";

        System.out.println("Original: " + input);

        String encoded = encode(input);
        System.out.println("Encoded: " + encoded);
        System.out.printf("Compression ratio: %.2f%%\n",
            100.0 * encoded.length() / input.length());

        String decoded = decode(encoded);
        System.out.println("Decoded: " + decoded);
    }
}

Complexity Analysis

Time Complexity: O(n)
Space Complexity: O(n)
Compression Ratio: Depends on data
- Best case: Long runs → excellent compression
- Worst case: No runs → data expands

Huffman Coding

Huffman coding assigns variable-length codes to characters based on frequency. More frequent characters get shorter codes.

Algorithm

1. Count frequency of each character
2. Build min-heap of nodes (frequency, character)
3. While heap has more than one node:
   a. Extract two nodes with minimum frequency
   b. Create new node with sum of frequencies
   c. Make extracted nodes children
   d. Insert new node into heap
4. Build code table from tree (left=0, right=1)
5. Encode text using code table

Visualization

Text: "ABRACADABRA"

Frequency:
A: 5
B: 2
R: 2
C: 1
D: 1

Huffman Tree:
         (11)
        /    \
      (5)    (6)
       A    /   \
          (3)   (3)
          / \   / \
        (1)(2)(1)(2)
         C  B  D  R

Codes:
A: 0
B: 101
R: 111
C: 100
D: 110

Encoded: 01011110010001101011110
Original: 11 bytes × 8 bits = 88 bits
Compressed: 23 bits (73.9% reduction)

C++ Implementation

#include <iostream>
#include <queue>
#include <unordered_map>
#include <string>
using namespace std;

struct Node {
    char ch;
    int freq;
    Node *left, *right;

    Node(char c, int f) : ch(c), freq(f), left(nullptr), right(nullptr) {}
};

struct Compare {
    bool operator()(Node* a, Node* b) {
        return a->freq > b->freq;
    }
};

class HuffmanCoding {
private:
    Node* root;
    unordered_map<char, string> codes;
    unordered_map<string, char> reverseCode;

    void buildCodes(Node* node, string code) {
        if (!node) return;

        if (!node->left && !node->right) {
            codes[node->ch] = code;
            reverseCode[code] = node->ch;
        }

        buildCodes(node->left, code + "0");
        buildCodes(node->right, code + "1");
    }

    void deleteTree(Node* node) {
        if (!node) return;
        deleteTree(node->left);
        deleteTree(node->right);
        delete node;
    }

public:
    HuffmanCoding() : root(nullptr) {}

    ~HuffmanCoding() {
        deleteTree(root);
    }

    void buildTree(const string& text) {
        // Count frequencies
        unordered_map<char, int> freq;
        for (char ch : text) {
            freq[ch]++;
        }

        // Build min-heap
        priority_queue<Node*, vector<Node*>, Compare> pq;

        for (auto& pair : freq) {
            pq.push(new Node(pair.first, pair.second));
        }

        // Build Huffman tree
        while (pq.size() > 1) {
            Node* left = pq.top(); pq.pop();
            Node* right = pq.top(); pq.pop();

            Node* parent = new Node('\0', left->freq + right->freq);
            parent->left = left;
            parent->right = right;

            pq.push(parent);
        }

        root = pq.top();

        // Build code table
        buildCodes(root, "");
    }

    string encode(const string& text) {
        string encoded;
        for (char ch : text) {
            encoded += codes[ch];
        }
        return encoded;
    }

    string decode(const string& encoded) {
        string decoded;
        string current;

        for (char bit : encoded) {
            current += bit;
            if (reverseCode.count(current)) {
                decoded += reverseCode[current];
                current = "";
            }
        }

        return decoded;
    }

    void printCodes() {
        cout << "Huffman Codes:" << endl;
        for (auto& pair : codes) {
            cout << pair.first << ": " << pair.second << endl;
        }
    }

    double compressionRatio(const string& original, const string& compressed) {
        int originalBits = original.length() * 8;
        int compressedBits = compressed.length();
        return 100.0 * (originalBits - compressedBits) / originalBits;
    }
};

int main() {
    string text = "ABRACADABRA";

    cout << "Original text: " << text << endl;
    cout << "Original size: " << text.length() * 8 << " bits" << endl << endl;

    HuffmanCoding huffman;
    huffman.buildTree(text);

    huffman.printCodes();
    cout << endl;

    string encoded = huffman.encode(text);
    cout << "Encoded: " << encoded << endl;
    cout << "Encoded size: " << encoded.length() << " bits" << endl;
    cout << "Compression ratio: " << huffman.compressionRatio(text, encoded) << "%" << endl << endl;

    string decoded = huffman.decode(encoded);
    cout << "Decoded: " << decoded << endl;

    return 0;
}

Java Implementation

import java.util.*;

class HuffmanNode implements Comparable<HuffmanNode> {
    char ch;
    int freq;
    HuffmanNode left, right;

    HuffmanNode(char ch, int freq) {
        this.ch = ch;
        this.freq = freq;
        this.left = null;
        this.right = null;
    }

    public int compareTo(HuffmanNode other) {
        return this.freq - other.freq;
    }
}

public class HuffmanCoding {
    private HuffmanNode root;
    private Map<Character, String> codes;
    private Map<String, Character> reverseCode;

    public HuffmanCoding() {
        codes = new HashMap<>();
        reverseCode = new HashMap<>();
    }

    public void buildTree(String text) {
        // Count frequencies
        Map<Character, Integer> freq = new HashMap<>();
        for (char ch : text.toCharArray()) {
            freq.put(ch, freq.getOrDefault(ch, 0) + 1);
        }

        // Build min-heap
        PriorityQueue<HuffmanNode> pq = new PriorityQueue<>();
        for (Map.Entry<Character, Integer> entry : freq.entrySet()) {
            pq.offer(new HuffmanNode(entry.getKey(), entry.getValue()));
        }

        // Build Huffman tree
        while (pq.size() > 1) {
            HuffmanNode left = pq.poll();
            HuffmanNode right = pq.poll();

            HuffmanNode parent = new HuffmanNode('\0', left.freq + right.freq);
            parent.left = left;
            parent.right = right;

            pq.offer(parent);
        }

        root = pq.poll();

        // Build code table
        buildCodes(root, "");
    }

    private void buildCodes(HuffmanNode node, String code) {
        if (node == null) return;

        if (node.left == null && node.right == null) {
            codes.put(node.ch, code);
            reverseCode.put(code, node.ch);
        }

        buildCodes(node.left, code + "0");
        buildCodes(node.right, code + "1");
    }

    public String encode(String text) {
        StringBuilder encoded = new StringBuilder();
        for (char ch : text.toCharArray()) {
            encoded.append(codes.get(ch));
        }
        return encoded.toString();
    }

    public String decode(String encoded) {
        StringBuilder decoded = new StringBuilder();
        StringBuilder current = new StringBuilder();

        for (char bit : encoded.toCharArray()) {
            current.append(bit);
            if (reverseCode.containsKey(current.toString())) {
                decoded.append(reverseCode.get(current.toString()));
                current = new StringBuilder();
            }
        }

        return decoded.toString();
    }

    public void printCodes() {
        System.out.println("Huffman Codes:");
        for (Map.Entry<Character, String> entry : codes.entrySet()) {
            System.out.println(entry.getKey() + ": " + entry.getValue());
        }
    }

    public static void main(String[] args) {
        String text = "ABRACADABRA";

        System.out.println("Original text: " + text);
        System.out.println("Original size: " + (text.length() * 8) + " bits\n");

        HuffmanCoding huffman = new HuffmanCoding();
        huffman.buildTree(text);

        huffman.printCodes();
        System.out.println();

        String encoded = huffman.encode(text);
        System.out.println("Encoded: " + encoded);
        System.out.println("Encoded size: " + encoded.length() + " bits");
        System.out.printf("Compression ratio: %.2f%%\n\n",
            100.0 * (text.length() * 8 - encoded.length()) / (text.length() * 8));

        String decoded = huffman.decode(encoded);
        System.out.println("Decoded: " + decoded);
    }
}

Complexity Analysis

Time Complexity: O(n log n) for building tree
Space Complexity: O(n) for tree and codes
Optimal: Huffman produces optimal prefix-free codes

Lempel-Ziv-Welch (LZW)

LZW is dictionary-based compression used in GIF and UNIX compress.

Basic Concept

Build dictionary dynamically:
- Start with single-character entries
- Add new patterns as encountered
- Output dictionary indices

Example: "ABABABA"
Dictionary: {A:0, B:1, AB:2, BA:3, ABA:4}
Output: 0,1,2,3 (instead of 0,1,0,1,0,1,0)

Comparison of Algorithms

Algorithm	Type	Best For	Compression Ratio	Speed
RLE	Lossless	Repeated data	Poor on random data	Very fast
Huffman	Lossless	Text, varied frequency	Good	Fast
LZW	Lossless	Text with patterns	Very good	Moderate
JPEG	Lossy	Photos	Excellent	Moderate
MP3	Lossy	Audio	Excellent	Moderate

Common Mistakes

RLE on random data - Can expand instead of compress
Not handling edge cases - Empty input, single character
Memory leaks - Not freeing Huffman tree
Wrong bit packing - Huffman codes must be stored as bits
Inefficient decoding - Linear search instead of tree traversal
Forgetting dictionary - Need to store/transmit for decompression
Integer overflow - In frequency counts

Debugging Tips

Test with known inputs - Verify against hand calculations
Check compression ratio - Should compress, not expand
Verify lossless - Encode then decode = original
Test edge cases - Empty, single char, all same char
Print intermediate states - Frequencies, codes, dictionary
Visualize Huffman tree - Draw tree structure
Measure performance - Time and space on real data

Mini Exercises

Implement RLE for binary data
Handle RLE worst case (alternating characters)
Build canonical Huffman codes
Implement adaptive Huffman coding
Compress image using RLE
Implement LZW compression
Compare algorithms on different data types
Build Burrows-Wheeler Transform
Implement arithmetic coding (basic)
Create hybrid compression (RLE + Huffman)

Review Questions

When does RLE compress well vs expand data?
Why is Huffman coding optimal for prefix-free codes?
What's the difference between static and adaptive compression?
How do dictionary-based algorithms work?
When would you choose lossy over lossless compression?

Reference Checklist

By the end of this chapter, you should be able to:

Implement Run-Length Encoding
Build Huffman tree from frequencies
Encode and decode with Huffman codes
Calculate compression ratios
Choose appropriate algorithm for data type
Understand lossless vs lossy trade-offs
Analyze compression complexity
Handle edge cases in compression

Next Steps

Chapter 17 explores encoding techniques—Base64, UTF-8, Hex encoding, and data representation methods used in web protocols, file formats, and data transmission.

Key Takeaway: Compression algorithms reduce data size by eliminating redundancy. RLE works for repeated data, Huffman for varied frequencies, and dictionary methods for patterns. Understanding these algorithms helps you choose the right compression for your application and optimize storage and transmission efficiency.