Compression Algorithms (RLE, Huffman Coding)

Chapter 16: Compression Algorithms (RLE, Huffman Coding)

Introduction

Data compression reduces file sizes by identifying and eliminating redundancy. This chapter explores fundamental compression algorithms including Run-Length Encoding (RLE), Huffman Coding, and dictionary-based methods. Understanding compression is essential for efficient storage, faster transmission, and optimal resource utilization.

Why This Matters

Compression powers everyday technology:

  • File formats: ZIP, PNG, JPEG, MP3, MP4
  • Web performance: GZIP compression for HTTP
  • Video streaming: Netflix, YouTube compression
  • Databases: Compressed storage saves space
  • Network protocols: Reduced bandwidth usage
  • Backups: Efficient storage of archival data
  • Text processing: Log file compression

How to Study This Chapter

  1. Understand lossless vs lossy - When data must be preserved exactly
  2. Trace compression by hand - Work through encoding/decoding examples
  3. Implement algorithms - RLE, Huffman, LZW
  4. Measure compression ratios - Compare algorithm effectiveness
  5. Analyze trade-offs - Compression ratio vs speed vs memory

Lossless vs Lossy Compression

Lossless Compression

Lossless: Original data can be perfectly reconstructed.

Examples: ZIP, PNG, FLAC, Huffman, RLE

Use cases: Text files, executables, medical images

Lossy Compression

Lossy: Some information is permanently lost.

Examples: JPEG, MP3, MP4

Use cases: Photos, music, video (where perfect accuracy isn't critical)

Run-Length Encoding (RLE)

RLE compresses sequences of repeated values.

Algorithm

Encode: Replace runs of identical values with (count, value)

"AAAAAABBBCCCCC" → "6A3B5C"

When RLE Works Well

  • Images with large uniform areas
  • Simple graphics
  • Fax transmissions
  • Screen recordings with static regions

C Implementation

#include <stdio.h>
#include <string.h>

void rleEncode(const char* input, char* output) {
    int len = strlen(input);
    int j = 0;

    for (int i = 0; i < len; i++) {
        int count = 1;

        // Count consecutive identical characters
        while (i + 1 < len && input[i] == input[i + 1]) {
            count++;
            i++;
        }

        // Write count and character
        j += sprintf(output + j, "%d%c", count, input[i]);
    }

    output[j] = '\0';
}

void rleDecode(const char* input, char* output) {
    int len = strlen(input);
    int j = 0;

    for (int i = 0; i < len; ) {
        int count = 0;

        // Read count
        while (i < len && input[i] >= '0' && input[i] <= '9') {
            count = count * 10 + (input[i] - '0');
            i++;
        }

        // Read character and repeat
        if (i < len) {
            char ch = input[i];
            for (int k = 0; k < count; k++) {
                output[j++] = ch;
            }
            i++;
        }
    }

    output[j] = '\0';
}

int main() {
    char input[] = "AAAAAABBBCCCCC";
    char encoded[100];
    char decoded[100];

    printf("Original: %s\n", input);

    rleEncode(input, encoded);
    printf("Encoded: %s\n", encoded);
    printf("Compression ratio: %.2f%%\n",
           100.0 * strlen(encoded) / strlen(input));

    rleDecode(encoded, decoded);
    printf("Decoded: %s\n", decoded);

    return 0;
}

C++ Implementation

#include <iostream>
#include <string>
#include <sstream>
using namespace std;

string rleEncode(const string& input) {
    if (input.empty()) return "";

    stringstream encoded;
    int count = 1;

    for (size_t i = 1; i <= input.length(); i++) {
        if (i < input.length() && input[i] == input[i - 1]) {
            count++;
        }
        else {
            encoded << count << input[i - 1];
            count = 1;
        }
    }

    return encoded.str();
}

string rleDecode(const string& input) {
    stringstream decoded;
    int count = 0;

    for (char ch : input) {
        if (isdigit(ch)) {
            count = count * 10 + (ch - '0');
        }
        else {
            for (int i = 0; i < count; i++) {
                decoded << ch;
            }
            count = 0;
        }
    }

    return decoded.str();
}

int main() {
    string input = "AAAAAABBBCCCCC";

    cout << "Original: " << input << endl;

    string encoded = rleEncode(input);
    cout << "Encoded: " << encoded << endl;
    cout << "Compression ratio: " << (100.0 * encoded.length() / input.length()) << "%" << endl;

    string decoded = rleDecode(encoded);
    cout << "Decoded: " << decoded << endl;

    return 0;
}

Java Implementation

public class RunLengthEncoding {
    public static String encode(String input) {
        if (input == null || input.isEmpty()) {
            return "";
        }

        StringBuilder encoded = new StringBuilder();
        int count = 1;

        for (int i = 1; i <= input.length(); i++) {
            if (i < input.length() && input.charAt(i) == input.charAt(i - 1)) {
                count++;
            } else {
                encoded.append(count).append(input.charAt(i - 1));
                count = 1;
            }
        }

        return encoded.toString();
    }

    public static String decode(String input) {
        StringBuilder decoded = new StringBuilder();
        int count = 0;

        for (char ch : input.toCharArray()) {
            if (Character.isDigit(ch)) {
                count = count * 10 + (ch - '0');
            } else {
                for (int i = 0; i < count; i++) {
                    decoded.append(ch);
                }
                count = 0;
            }
        }

        return decoded.toString();
    }

    public static void main(String[] args) {
        String input = "AAAAAABBBCCCCC";

        System.out.println("Original: " + input);

        String encoded = encode(input);
        System.out.println("Encoded: " + encoded);
        System.out.printf("Compression ratio: %.2f%%\n",
            100.0 * encoded.length() / input.length());

        String decoded = decode(encoded);
        System.out.println("Decoded: " + decoded);
    }
}

Complexity Analysis

  • Time Complexity: O(n)
  • Space Complexity: O(n)
  • Compression Ratio: Depends on data
    • Best case: Long runs → excellent compression
    • Worst case: No runs → data expands

Huffman Coding

Huffman coding assigns variable-length codes to characters based on frequency. More frequent characters get shorter codes.

Algorithm

1. Count frequency of each character
2. Build min-heap of nodes (frequency, character)
3. While heap has more than one node:
   a. Extract two nodes with minimum frequency
   b. Create new node with sum of frequencies
   c. Make extracted nodes children
   d. Insert new node into heap
4. Build code table from tree (left=0, right=1)
5. Encode text using code table

Visualization

Text: "ABRACADABRA"

Frequency:
A: 5
B: 2
R: 2
C: 1
D: 1

Huffman Tree:
         (11)
        /    \
      (5)    (6)
       A    /   \
          (3)   (3)
          / \   / \
        (1)(2)(1)(2)
         C  B  D  R

Codes:
A: 0
B: 101
R: 111
C: 100
D: 110

Encoded: 01011110010001101011110
Original: 11 bytes × 8 bits = 88 bits
Compressed: 23 bits (73.9% reduction)

C++ Implementation

#include <iostream>
#include <queue>
#include <unordered_map>
#include <string>
using namespace std;

struct Node {
    char ch;
    int freq;
    Node *left, *right;

    Node(char c, int f) : ch(c), freq(f), left(nullptr), right(nullptr) {}
};

struct Compare {
    bool operator()(Node* a, Node* b) {
        return a->freq > b->freq;
    }
};

class HuffmanCoding {
private:
    Node* root;
    unordered_map<char, string> codes;
    unordered_map<string, char> reverseCode;

    void buildCodes(Node* node, string code) {
        if (!node) return;

        if (!node->left && !node->right) {
            codes[node->ch] = code;
            reverseCode[code] = node->ch;
        }

        buildCodes(node->left, code + "0");
        buildCodes(node->right, code + "1");
    }

    void deleteTree(Node* node) {
        if (!node) return;
        deleteTree(node->left);
        deleteTree(node->right);
        delete node;
    }

public:
    HuffmanCoding() : root(nullptr) {}

    ~HuffmanCoding() {
        deleteTree(root);
    }

    void buildTree(const string& text) {
        // Count frequencies
        unordered_map<char, int> freq;
        for (char ch : text) {
            freq[ch]++;
        }

        // Build min-heap
        priority_queue<Node*, vector<Node*>, Compare> pq;

        for (auto& pair : freq) {
            pq.push(new Node(pair.first, pair.second));
        }

        // Build Huffman tree
        while (pq.size() > 1) {
            Node* left = pq.top(); pq.pop();
            Node* right = pq.top(); pq.pop();

            Node* parent = new Node('\0', left->freq + right->freq);
            parent->left = left;
            parent->right = right;

            pq.push(parent);
        }

        root = pq.top();

        // Build code table
        buildCodes(root, "");
    }

    string encode(const string& text) {
        string encoded;
        for (char ch : text) {
            encoded += codes[ch];
        }
        return encoded;
    }

    string decode(const string& encoded) {
        string decoded;
        string current;

        for (char bit : encoded) {
            current += bit;
            if (reverseCode.count(current)) {
                decoded += reverseCode[current];
                current = "";
            }
        }

        return decoded;
    }

    void printCodes() {
        cout << "Huffman Codes:" << endl;
        for (auto& pair : codes) {
            cout << pair.first << ": " << pair.second << endl;
        }
    }

    double compressionRatio(const string& original, const string& compressed) {
        int originalBits = original.length() * 8;
        int compressedBits = compressed.length();
        return 100.0 * (originalBits - compressedBits) / originalBits;
    }
};

int main() {
    string text = "ABRACADABRA";

    cout << "Original text: " << text << endl;
    cout << "Original size: " << text.length() * 8 << " bits" << endl << endl;

    HuffmanCoding huffman;
    huffman.buildTree(text);

    huffman.printCodes();
    cout << endl;

    string encoded = huffman.encode(text);
    cout << "Encoded: " << encoded << endl;
    cout << "Encoded size: " << encoded.length() << " bits" << endl;
    cout << "Compression ratio: " << huffman.compressionRatio(text, encoded) << "%" << endl << endl;

    string decoded = huffman.decode(encoded);
    cout << "Decoded: " << decoded << endl;

    return 0;
}

Java Implementation

import java.util.*;

class HuffmanNode implements Comparable<HuffmanNode> {
    char ch;
    int freq;
    HuffmanNode left, right;

    HuffmanNode(char ch, int freq) {
        this.ch = ch;
        this.freq = freq;
        this.left = null;
        this.right = null;
    }

    public int compareTo(HuffmanNode other) {
        return this.freq - other.freq;
    }
}

public class HuffmanCoding {
    private HuffmanNode root;
    private Map<Character, String> codes;
    private Map<String, Character> reverseCode;

    public HuffmanCoding() {
        codes = new HashMap<>();
        reverseCode = new HashMap<>();
    }

    public void buildTree(String text) {
        // Count frequencies
        Map<Character, Integer> freq = new HashMap<>();
        for (char ch : text.toCharArray()) {
            freq.put(ch, freq.getOrDefault(ch, 0) + 1);
        }

        // Build min-heap
        PriorityQueue<HuffmanNode> pq = new PriorityQueue<>();
        for (Map.Entry<Character, Integer> entry : freq.entrySet()) {
            pq.offer(new HuffmanNode(entry.getKey(), entry.getValue()));
        }

        // Build Huffman tree
        while (pq.size() > 1) {
            HuffmanNode left = pq.poll();
            HuffmanNode right = pq.poll();

            HuffmanNode parent = new HuffmanNode('\0', left.freq + right.freq);
            parent.left = left;
            parent.right = right;

            pq.offer(parent);
        }

        root = pq.poll();

        // Build code table
        buildCodes(root, "");
    }

    private void buildCodes(HuffmanNode node, String code) {
        if (node == null) return;

        if (node.left == null && node.right == null) {
            codes.put(node.ch, code);
            reverseCode.put(code, node.ch);
        }

        buildCodes(node.left, code + "0");
        buildCodes(node.right, code + "1");
    }

    public String encode(String text) {
        StringBuilder encoded = new StringBuilder();
        for (char ch : text.toCharArray()) {
            encoded.append(codes.get(ch));
        }
        return encoded.toString();
    }

    public String decode(String encoded) {
        StringBuilder decoded = new StringBuilder();
        StringBuilder current = new StringBuilder();

        for (char bit : encoded.toCharArray()) {
            current.append(bit);
            if (reverseCode.containsKey(current.toString())) {
                decoded.append(reverseCode.get(current.toString()));
                current = new StringBuilder();
            }
        }

        return decoded.toString();
    }

    public void printCodes() {
        System.out.println("Huffman Codes:");
        for (Map.Entry<Character, String> entry : codes.entrySet()) {
            System.out.println(entry.getKey() + ": " + entry.getValue());
        }
    }

    public static void main(String[] args) {
        String text = "ABRACADABRA";

        System.out.println("Original text: " + text);
        System.out.println("Original size: " + (text.length() * 8) + " bits\n");

        HuffmanCoding huffman = new HuffmanCoding();
        huffman.buildTree(text);

        huffman.printCodes();
        System.out.println();

        String encoded = huffman.encode(text);
        System.out.println("Encoded: " + encoded);
        System.out.println("Encoded size: " + encoded.length() + " bits");
        System.out.printf("Compression ratio: %.2f%%\n\n",
            100.0 * (text.length() * 8 - encoded.length()) / (text.length() * 8));

        String decoded = huffman.decode(encoded);
        System.out.println("Decoded: " + decoded);
    }
}

Complexity Analysis

  • Time Complexity: O(n log n) for building tree
  • Space Complexity: O(n) for tree and codes
  • Optimal: Huffman produces optimal prefix-free codes

Lempel-Ziv-Welch (LZW)

LZW is dictionary-based compression used in GIF and UNIX compress.

Basic Concept

Build dictionary dynamically:
- Start with single-character entries
- Add new patterns as encountered
- Output dictionary indices

Example: "ABABABA"
Dictionary: {A:0, B:1, AB:2, BA:3, ABA:4}
Output: 0,1,2,3 (instead of 0,1,0,1,0,1,0)

Comparison of Algorithms

AlgorithmTypeBest ForCompression RatioSpeed
RLELosslessRepeated dataPoor on random dataVery fast
HuffmanLosslessText, varied frequencyGoodFast
LZWLosslessText with patternsVery goodModerate
JPEGLossyPhotosExcellentModerate
MP3LossyAudioExcellentModerate

Common Mistakes

  1. RLE on random data - Can expand instead of compress
  2. Not handling edge cases - Empty input, single character
  3. Memory leaks - Not freeing Huffman tree
  4. Wrong bit packing - Huffman codes must be stored as bits
  5. Inefficient decoding - Linear search instead of tree traversal
  6. Forgetting dictionary - Need to store/transmit for decompression
  7. Integer overflow - In frequency counts

Debugging Tips

  1. Test with known inputs - Verify against hand calculations
  2. Check compression ratio - Should compress, not expand
  3. Verify lossless - Encode then decode = original
  4. Test edge cases - Empty, single char, all same char
  5. Print intermediate states - Frequencies, codes, dictionary
  6. Visualize Huffman tree - Draw tree structure
  7. Measure performance - Time and space on real data

Mini Exercises

  1. Implement RLE for binary data
  2. Handle RLE worst case (alternating characters)
  3. Build canonical Huffman codes
  4. Implement adaptive Huffman coding
  5. Compress image using RLE
  6. Implement LZW compression
  7. Compare algorithms on different data types
  8. Build Burrows-Wheeler Transform
  9. Implement arithmetic coding (basic)
  10. Create hybrid compression (RLE + Huffman)

Review Questions

  1. When does RLE compress well vs expand data?
  2. Why is Huffman coding optimal for prefix-free codes?
  3. What's the difference between static and adaptive compression?
  4. How do dictionary-based algorithms work?
  5. When would you choose lossy over lossless compression?

Reference Checklist

By the end of this chapter, you should be able to:

  • Implement Run-Length Encoding
  • Build Huffman tree from frequencies
  • Encode and decode with Huffman codes
  • Calculate compression ratios
  • Choose appropriate algorithm for data type
  • Understand lossless vs lossy trade-offs
  • Analyze compression complexity
  • Handle edge cases in compression

Next Steps

Chapter 17 explores encoding techniques—Base64, UTF-8, Hex encoding, and data representation methods used in web protocols, file formats, and data transmission.


Key Takeaway: Compression algorithms reduce data size by eliminating redundancy. RLE works for repeated data, Huffman for varied frequencies, and dictionary methods for patterns. Understanding these algorithms helps you choose the right compression for your application and optimize storage and transmission efficiency.