Chapter 16: Compression Algorithms (RLE, Huffman Coding)
Introduction
Data compression reduces file sizes by identifying and eliminating redundancy. This chapter explores fundamental compression algorithms including Run-Length Encoding (RLE), Huffman Coding, and dictionary-based methods. Understanding compression is essential for efficient storage, faster transmission, and optimal resource utilization.
Why This Matters
Compression powers everyday technology:
- File formats: ZIP, PNG, JPEG, MP3, MP4
- Web performance: GZIP compression for HTTP
- Video streaming: Netflix, YouTube compression
- Databases: Compressed storage saves space
- Network protocols: Reduced bandwidth usage
- Backups: Efficient storage of archival data
- Text processing: Log file compression
How to Study This Chapter
- Understand lossless vs lossy - When data must be preserved exactly
- Trace compression by hand - Work through encoding/decoding examples
- Implement algorithms - RLE, Huffman, LZW
- Measure compression ratios - Compare algorithm effectiveness
- Analyze trade-offs - Compression ratio vs speed vs memory
Lossless vs Lossy Compression
Lossless Compression
Lossless: Original data can be perfectly reconstructed.
Examples: ZIP, PNG, FLAC, Huffman, RLE
Use cases: Text files, executables, medical images
Lossy Compression
Lossy: Some information is permanently lost.
Examples: JPEG, MP3, MP4
Use cases: Photos, music, video (where perfect accuracy isn't critical)
Run-Length Encoding (RLE)
RLE compresses sequences of repeated values.
Algorithm
Encode: Replace runs of identical values with (count, value)
"AAAAAABBBCCCCC" → "6A3B5C"
When RLE Works Well
- Images with large uniform areas
- Simple graphics
- Fax transmissions
- Screen recordings with static regions
C Implementation
#include <stdio.h>
#include <string.h>
void rleEncode(const char* input, char* output) {
int len = strlen(input);
int j = 0;
for (int i = 0; i < len; i++) {
int count = 1;
// Count consecutive identical characters
while (i + 1 < len && input[i] == input[i + 1]) {
count++;
i++;
}
// Write count and character
j += sprintf(output + j, "%d%c", count, input[i]);
}
output[j] = '\0';
}
void rleDecode(const char* input, char* output) {
int len = strlen(input);
int j = 0;
for (int i = 0; i < len; ) {
int count = 0;
// Read count
while (i < len && input[i] >= '0' && input[i] <= '9') {
count = count * 10 + (input[i] - '0');
i++;
}
// Read character and repeat
if (i < len) {
char ch = input[i];
for (int k = 0; k < count; k++) {
output[j++] = ch;
}
i++;
}
}
output[j] = '\0';
}
int main() {
char input[] = "AAAAAABBBCCCCC";
char encoded[100];
char decoded[100];
printf("Original: %s\n", input);
rleEncode(input, encoded);
printf("Encoded: %s\n", encoded);
printf("Compression ratio: %.2f%%\n",
100.0 * strlen(encoded) / strlen(input));
rleDecode(encoded, decoded);
printf("Decoded: %s\n", decoded);
return 0;
}
C++ Implementation
#include <iostream>
#include <string>
#include <sstream>
using namespace std;
string rleEncode(const string& input) {
if (input.empty()) return "";
stringstream encoded;
int count = 1;
for (size_t i = 1; i <= input.length(); i++) {
if (i < input.length() && input[i] == input[i - 1]) {
count++;
}
else {
encoded << count << input[i - 1];
count = 1;
}
}
return encoded.str();
}
string rleDecode(const string& input) {
stringstream decoded;
int count = 0;
for (char ch : input) {
if (isdigit(ch)) {
count = count * 10 + (ch - '0');
}
else {
for (int i = 0; i < count; i++) {
decoded << ch;
}
count = 0;
}
}
return decoded.str();
}
int main() {
string input = "AAAAAABBBCCCCC";
cout << "Original: " << input << endl;
string encoded = rleEncode(input);
cout << "Encoded: " << encoded << endl;
cout << "Compression ratio: " << (100.0 * encoded.length() / input.length()) << "%" << endl;
string decoded = rleDecode(encoded);
cout << "Decoded: " << decoded << endl;
return 0;
}
Java Implementation
public class RunLengthEncoding {
public static String encode(String input) {
if (input == null || input.isEmpty()) {
return "";
}
StringBuilder encoded = new StringBuilder();
int count = 1;
for (int i = 1; i <= input.length(); i++) {
if (i < input.length() && input.charAt(i) == input.charAt(i - 1)) {
count++;
} else {
encoded.append(count).append(input.charAt(i - 1));
count = 1;
}
}
return encoded.toString();
}
public static String decode(String input) {
StringBuilder decoded = new StringBuilder();
int count = 0;
for (char ch : input.toCharArray()) {
if (Character.isDigit(ch)) {
count = count * 10 + (ch - '0');
} else {
for (int i = 0; i < count; i++) {
decoded.append(ch);
}
count = 0;
}
}
return decoded.toString();
}
public static void main(String[] args) {
String input = "AAAAAABBBCCCCC";
System.out.println("Original: " + input);
String encoded = encode(input);
System.out.println("Encoded: " + encoded);
System.out.printf("Compression ratio: %.2f%%\n",
100.0 * encoded.length() / input.length());
String decoded = decode(encoded);
System.out.println("Decoded: " + decoded);
}
}
Complexity Analysis
- Time Complexity: O(n)
- Space Complexity: O(n)
- Compression Ratio: Depends on data
- Best case: Long runs → excellent compression
- Worst case: No runs → data expands
Huffman Coding
Huffman coding assigns variable-length codes to characters based on frequency. More frequent characters get shorter codes.
Algorithm
1. Count frequency of each character
2. Build min-heap of nodes (frequency, character)
3. While heap has more than one node:
a. Extract two nodes with minimum frequency
b. Create new node with sum of frequencies
c. Make extracted nodes children
d. Insert new node into heap
4. Build code table from tree (left=0, right=1)
5. Encode text using code table
Visualization
Text: "ABRACADABRA"
Frequency:
A: 5
B: 2
R: 2
C: 1
D: 1
Huffman Tree:
(11)
/ \
(5) (6)
A / \
(3) (3)
/ \ / \
(1)(2)(1)(2)
C B D R
Codes:
A: 0
B: 101
R: 111
C: 100
D: 110
Encoded: 01011110010001101011110
Original: 11 bytes × 8 bits = 88 bits
Compressed: 23 bits (73.9% reduction)
C++ Implementation
#include <iostream>
#include <queue>
#include <unordered_map>
#include <string>
using namespace std;
struct Node {
char ch;
int freq;
Node *left, *right;
Node(char c, int f) : ch(c), freq(f), left(nullptr), right(nullptr) {}
};
struct Compare {
bool operator()(Node* a, Node* b) {
return a->freq > b->freq;
}
};
class HuffmanCoding {
private:
Node* root;
unordered_map<char, string> codes;
unordered_map<string, char> reverseCode;
void buildCodes(Node* node, string code) {
if (!node) return;
if (!node->left && !node->right) {
codes[node->ch] = code;
reverseCode[code] = node->ch;
}
buildCodes(node->left, code + "0");
buildCodes(node->right, code + "1");
}
void deleteTree(Node* node) {
if (!node) return;
deleteTree(node->left);
deleteTree(node->right);
delete node;
}
public:
HuffmanCoding() : root(nullptr) {}
~HuffmanCoding() {
deleteTree(root);
}
void buildTree(const string& text) {
// Count frequencies
unordered_map<char, int> freq;
for (char ch : text) {
freq[ch]++;
}
// Build min-heap
priority_queue<Node*, vector<Node*>, Compare> pq;
for (auto& pair : freq) {
pq.push(new Node(pair.first, pair.second));
}
// Build Huffman tree
while (pq.size() > 1) {
Node* left = pq.top(); pq.pop();
Node* right = pq.top(); pq.pop();
Node* parent = new Node('\0', left->freq + right->freq);
parent->left = left;
parent->right = right;
pq.push(parent);
}
root = pq.top();
// Build code table
buildCodes(root, "");
}
string encode(const string& text) {
string encoded;
for (char ch : text) {
encoded += codes[ch];
}
return encoded;
}
string decode(const string& encoded) {
string decoded;
string current;
for (char bit : encoded) {
current += bit;
if (reverseCode.count(current)) {
decoded += reverseCode[current];
current = "";
}
}
return decoded;
}
void printCodes() {
cout << "Huffman Codes:" << endl;
for (auto& pair : codes) {
cout << pair.first << ": " << pair.second << endl;
}
}
double compressionRatio(const string& original, const string& compressed) {
int originalBits = original.length() * 8;
int compressedBits = compressed.length();
return 100.0 * (originalBits - compressedBits) / originalBits;
}
};
int main() {
string text = "ABRACADABRA";
cout << "Original text: " << text << endl;
cout << "Original size: " << text.length() * 8 << " bits" << endl << endl;
HuffmanCoding huffman;
huffman.buildTree(text);
huffman.printCodes();
cout << endl;
string encoded = huffman.encode(text);
cout << "Encoded: " << encoded << endl;
cout << "Encoded size: " << encoded.length() << " bits" << endl;
cout << "Compression ratio: " << huffman.compressionRatio(text, encoded) << "%" << endl << endl;
string decoded = huffman.decode(encoded);
cout << "Decoded: " << decoded << endl;
return 0;
}
Java Implementation
import java.util.*;
class HuffmanNode implements Comparable<HuffmanNode> {
char ch;
int freq;
HuffmanNode left, right;
HuffmanNode(char ch, int freq) {
this.ch = ch;
this.freq = freq;
this.left = null;
this.right = null;
}
public int compareTo(HuffmanNode other) {
return this.freq - other.freq;
}
}
public class HuffmanCoding {
private HuffmanNode root;
private Map<Character, String> codes;
private Map<String, Character> reverseCode;
public HuffmanCoding() {
codes = new HashMap<>();
reverseCode = new HashMap<>();
}
public void buildTree(String text) {
// Count frequencies
Map<Character, Integer> freq = new HashMap<>();
for (char ch : text.toCharArray()) {
freq.put(ch, freq.getOrDefault(ch, 0) + 1);
}
// Build min-heap
PriorityQueue<HuffmanNode> pq = new PriorityQueue<>();
for (Map.Entry<Character, Integer> entry : freq.entrySet()) {
pq.offer(new HuffmanNode(entry.getKey(), entry.getValue()));
}
// Build Huffman tree
while (pq.size() > 1) {
HuffmanNode left = pq.poll();
HuffmanNode right = pq.poll();
HuffmanNode parent = new HuffmanNode('\0', left.freq + right.freq);
parent.left = left;
parent.right = right;
pq.offer(parent);
}
root = pq.poll();
// Build code table
buildCodes(root, "");
}
private void buildCodes(HuffmanNode node, String code) {
if (node == null) return;
if (node.left == null && node.right == null) {
codes.put(node.ch, code);
reverseCode.put(code, node.ch);
}
buildCodes(node.left, code + "0");
buildCodes(node.right, code + "1");
}
public String encode(String text) {
StringBuilder encoded = new StringBuilder();
for (char ch : text.toCharArray()) {
encoded.append(codes.get(ch));
}
return encoded.toString();
}
public String decode(String encoded) {
StringBuilder decoded = new StringBuilder();
StringBuilder current = new StringBuilder();
for (char bit : encoded.toCharArray()) {
current.append(bit);
if (reverseCode.containsKey(current.toString())) {
decoded.append(reverseCode.get(current.toString()));
current = new StringBuilder();
}
}
return decoded.toString();
}
public void printCodes() {
System.out.println("Huffman Codes:");
for (Map.Entry<Character, String> entry : codes.entrySet()) {
System.out.println(entry.getKey() + ": " + entry.getValue());
}
}
public static void main(String[] args) {
String text = "ABRACADABRA";
System.out.println("Original text: " + text);
System.out.println("Original size: " + (text.length() * 8) + " bits\n");
HuffmanCoding huffman = new HuffmanCoding();
huffman.buildTree(text);
huffman.printCodes();
System.out.println();
String encoded = huffman.encode(text);
System.out.println("Encoded: " + encoded);
System.out.println("Encoded size: " + encoded.length() + " bits");
System.out.printf("Compression ratio: %.2f%%\n\n",
100.0 * (text.length() * 8 - encoded.length()) / (text.length() * 8));
String decoded = huffman.decode(encoded);
System.out.println("Decoded: " + decoded);
}
}
Complexity Analysis
- Time Complexity: O(n log n) for building tree
- Space Complexity: O(n) for tree and codes
- Optimal: Huffman produces optimal prefix-free codes
Lempel-Ziv-Welch (LZW)
LZW is dictionary-based compression used in GIF and UNIX compress.
Basic Concept
Build dictionary dynamically:
- Start with single-character entries
- Add new patterns as encountered
- Output dictionary indices
Example: "ABABABA"
Dictionary: {A:0, B:1, AB:2, BA:3, ABA:4}
Output: 0,1,2,3 (instead of 0,1,0,1,0,1,0)
Comparison of Algorithms
| Algorithm | Type | Best For | Compression Ratio | Speed |
|---|---|---|---|---|
| RLE | Lossless | Repeated data | Poor on random data | Very fast |
| Huffman | Lossless | Text, varied frequency | Good | Fast |
| LZW | Lossless | Text with patterns | Very good | Moderate |
| JPEG | Lossy | Photos | Excellent | Moderate |
| MP3 | Lossy | Audio | Excellent | Moderate |
Common Mistakes
- RLE on random data - Can expand instead of compress
- Not handling edge cases - Empty input, single character
- Memory leaks - Not freeing Huffman tree
- Wrong bit packing - Huffman codes must be stored as bits
- Inefficient decoding - Linear search instead of tree traversal
- Forgetting dictionary - Need to store/transmit for decompression
- Integer overflow - In frequency counts
Debugging Tips
- Test with known inputs - Verify against hand calculations
- Check compression ratio - Should compress, not expand
- Verify lossless - Encode then decode = original
- Test edge cases - Empty, single char, all same char
- Print intermediate states - Frequencies, codes, dictionary
- Visualize Huffman tree - Draw tree structure
- Measure performance - Time and space on real data
Mini Exercises
- Implement RLE for binary data
- Handle RLE worst case (alternating characters)
- Build canonical Huffman codes
- Implement adaptive Huffman coding
- Compress image using RLE
- Implement LZW compression
- Compare algorithms on different data types
- Build Burrows-Wheeler Transform
- Implement arithmetic coding (basic)
- Create hybrid compression (RLE + Huffman)
Review Questions
- When does RLE compress well vs expand data?
- Why is Huffman coding optimal for prefix-free codes?
- What's the difference between static and adaptive compression?
- How do dictionary-based algorithms work?
- When would you choose lossy over lossless compression?
Reference Checklist
By the end of this chapter, you should be able to:
- Implement Run-Length Encoding
- Build Huffman tree from frequencies
- Encode and decode with Huffman codes
- Calculate compression ratios
- Choose appropriate algorithm for data type
- Understand lossless vs lossy trade-offs
- Analyze compression complexity
- Handle edge cases in compression
Next Steps
Chapter 17 explores encoding techniques—Base64, UTF-8, Hex encoding, and data representation methods used in web protocols, file formats, and data transmission.
Key Takeaway: Compression algorithms reduce data size by eliminating redundancy. RLE works for repeated data, Huffman for varied frequencies, and dictionary methods for patterns. Understanding these algorithms helps you choose the right compression for your application and optimize storage and transmission efficiency.