Chapter 17: Encoding Techniques (Base64, UTF-8, Hex)
Introduction
Encoding transforms data from one format to another for transmission, storage, or display. Unlike compression (which reduces size) or encryption (which hides meaning), encoding simply represents data in a different format. This chapter explores Base64, hexadecimal, UTF-8, and other encoding schemes essential for web development, file formats, and data interchange.
Why This Matters
Encoding enables data interchange:
- Email attachments: Base64 for binary data in MIME
- URLs: Percent-encoding for special characters
- Web APIs: JSON with Base64-encoded binary
- Images in HTML: Data URLs with Base64
- Text files: UTF-8 for international characters
- Color codes: Hexadecimal in CSS/HTML
- Debugging: Hex dumps for binary inspection
How to Study This Chapter
- Understand encoding vs encryption - Different purposes
- Learn character encoding - ASCII, UTF-8, Unicode
- Implement Base64 - Understand bit manipulation
- Practice hex conversion - Binary to hex and back
- Study URL encoding - Web-safe characters
ASCII and Character Encoding
ASCII (American Standard Code for Information Interchange)
ASCII: 7-bit encoding for 128 characters (0-127).
Character ranges:
0-31: Control characters
32-47: Space and punctuation
48-57: Digits '0'-'9'
65-90: Uppercase 'A'-'Z'
97-122: Lowercase 'a'-'z'
Examples:
'A' = 65 = 0x41 = 01000001
'a' = 97 = 0x61 = 01100001
'0' = 48 = 0x30 = 00110000
C Implementation
#include <stdio.h>
void printAscii(char c) {
printf("'%c' = %d = 0x%02X = ", c, c, (unsigned char)c);
// Print binary
for (int i = 7; i >= 0; i--) {
printf("%d", (c >> i) & 1);
}
printf("\n");
}
int main() {
printAscii('A');
printAscii('a');
printAscii('0');
printAscii(' ');
return 0;
}
UTF-8 (Unicode Transformation Format - 8-bit)
UTF-8: Variable-length encoding for Unicode (1-4 bytes per character).
1 byte: 0xxxxxxx (ASCII compatible)
2 bytes: 110xxxxx 10xxxxxx
3 bytes: 1110xxxx 10xxxxxx 10xxxxxx
4 bytes: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
Examples:
'A' (U+0041): 01000001 (1 byte)
'é' (U+00E9): 11000011 10101001 (2 bytes)
'€' (U+20AC): 11100010 10000010 10101100 (3 bytes)
'😀' (U+1F600): 11110000 10011111 10011000 10000000 (4 bytes)
C++ UTF-8 Example
#include <iostream>
#include <string>
#include <iomanip>
using namespace std;
void printUtf8Bytes(const string& str) {
cout << "UTF-8 encoding of \"" << str << "\":" << endl;
for (unsigned char c : str) {
cout << hex << setw(2) << setfill('0') << (int)c << " ";
}
cout << dec << endl;
}
int utf8CharLength(unsigned char firstByte) {
if ((firstByte & 0x80) == 0) return 1; // 0xxxxxxx
if ((firstByte & 0xE0) == 0xC0) return 2; // 110xxxxx
if ((firstByte & 0xF0) == 0xE0) return 3; // 1110xxxx
if ((firstByte & 0xF8) == 0xF0) return 4; // 11110xxx
return 0; // Invalid
}
int main() {
string text = "Hello 世界 😀";
printUtf8Bytes(text);
cout << "\nCharacter breakdown:" << endl;
int i = 0;
while (i < text.length()) {
int len = utf8CharLength(text[i]);
cout << "Character: ";
for (int j = 0; j < len && i + j < text.length(); j++) {
cout << hex << setw(2) << setfill('0')
<< (int)(unsigned char)text[i + j] << " ";
}
cout << dec << "(" << len << " bytes)" << endl;
i += len;
}
return 0;
}
Hexadecimal Encoding
Hexadecimal: Base-16 representation (0-9, A-F).
Uses
- Memory addresses
- Color codes (#FF5733)
- Hash values (SHA-256)
- Binary file dumps
- MAC addresses
C Implementation
#include <stdio.h>
#include <string.h>
#include <ctype.h>
void bytesToHex(const unsigned char* bytes, int len, char* hex) {
for (int i = 0; i < len; i++) {
sprintf(hex + (i * 2), "%02x", bytes[i]);
}
hex[len * 2] = '\0';
}
int hexToBytes(const char* hex, unsigned char* bytes) {
int len = strlen(hex);
if (len % 2 != 0) return -1; // Invalid hex string
for (int i = 0; i < len / 2; i++) {
sscanf(hex + (i * 2), "%2hhx", &bytes[i]);
}
return len / 2;
}
int main() {
unsigned char data[] = "Hello";
char hex[100];
bytesToHex(data, 5, hex);
printf("Bytes to Hex: %s\n", hex);
unsigned char decoded[100];
int decodedLen = hexToBytes(hex, decoded);
decoded[decodedLen] = '\0';
printf("Hex to Bytes: %s\n", decoded);
return 0;
}
C++ Implementation
#include <iostream>
#include <string>
#include <sstream>
#include <iomanip>
using namespace std;
string bytesToHex(const unsigned char* bytes, size_t len) {
stringstream ss;
for (size_t i = 0; i < len; i++) {
ss << hex << setw(2) << setfill('0') << (int)bytes[i];
}
return ss.str();
}
string hexToBytes(const string& hex) {
string bytes;
for (size_t i = 0; i < hex.length(); i += 2) {
string byteString = hex.substr(i, 2);
unsigned char byte = (unsigned char)strtol(byteString.c_str(), nullptr, 16);
bytes += byte;
}
return bytes;
}
int main() {
string data = "Hello";
string hex = bytesToHex((unsigned char*)data.c_str(), data.length());
cout << "Bytes to Hex: " << hex << endl;
string decoded = hexToBytes(hex);
cout << "Hex to Bytes: " << decoded << endl;
return 0;
}
Java Implementation
public class HexEncoding {
private static final char[] HEX_ARRAY = "0123456789abcdef".toCharArray();
public static String bytesToHex(byte[] bytes) {
char[] hexChars = new char[bytes.length * 2];
for (int i = 0; i < bytes.length; i++) {
int v = bytes[i] & 0xFF;
hexChars[i * 2] = HEX_ARRAY[v >>> 4];
hexChars[i * 2 + 1] = HEX_ARRAY[v & 0x0F];
}
return new String(hexChars);
}
public static byte[] hexToBytes(String hex) {
int len = hex.length();
byte[] bytes = new byte[len / 2];
for (int i = 0; i < len; i += 2) {
bytes[i / 2] = (byte) ((Character.digit(hex.charAt(i), 16) << 4)
+ Character.digit(hex.charAt(i + 1), 16));
}
return bytes;
}
public static void main(String[] args) {
String data = "Hello";
byte[] bytes = data.getBytes();
String hex = bytesToHex(bytes);
System.out.println("Bytes to Hex: " + hex);
byte[] decoded = hexToBytes(hex);
System.out.println("Hex to Bytes: " + new String(decoded));
}
}
Base64 Encoding
Base64: Encodes binary data as ASCII text using 64 characters (A-Z, a-z, 0-9, +, /).
How It Works
1. Take 3 bytes (24 bits)
2. Split into four 6-bit groups
3. Map each 6-bit value to Base64 character
4. Pad with '=' if needed
Example: "Man"
M = 77 = 01001101
a = 97 = 01100001
n = 110 = 01101110
Combined: 010011010110000101101110
Split into 6-bit groups:
010011 = 19 = 'T'
010110 = 22 = 'W'
000101 = 5 = 'F'
101110 = 46 = 'u'
Result: "TWFu"
C Implementation
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
static const char base64_chars[] =
"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";
char* base64Encode(const unsigned char* data, size_t input_length) {
size_t output_length = 4 * ((input_length + 2) / 3);
char* encoded = (char*)malloc(output_length + 1);
if (!encoded) return NULL;
size_t i, j;
for (i = 0, j = 0; i < input_length;) {
uint32_t octet_a = i < input_length ? data[i++] : 0;
uint32_t octet_b = i < input_length ? data[i++] : 0;
uint32_t octet_c = i < input_length ? data[i++] : 0;
uint32_t triple = (octet_a << 16) + (octet_b << 8) + octet_c;
encoded[j++] = base64_chars[(triple >> 18) & 0x3F];
encoded[j++] = base64_chars[(triple >> 12) & 0x3F];
encoded[j++] = base64_chars[(triple >> 6) & 0x3F];
encoded[j++] = base64_chars[triple & 0x3F];
}
// Padding
for (i = 0; i < (3 - input_length % 3) % 3; i++) {
encoded[output_length - 1 - i] = '=';
}
encoded[output_length] = '\0';
return encoded;
}
int main() {
const char* input = "Man";
char* encoded = base64Encode((unsigned char*)input, strlen(input));
printf("Original: %s\n", input);
printf("Base64: %s\n", encoded);
free(encoded);
return 0;
}
C++ Implementation
#include <iostream>
#include <string>
using namespace std;
const string base64_chars =
"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";
string base64Encode(const unsigned char* data, size_t len) {
string encoded;
int i = 0;
unsigned char char_array_3[3];
unsigned char char_array_4[4];
while (len--) {
char_array_3[i++] = *(data++);
if (i == 3) {
char_array_4[0] = (char_array_3[0] & 0xfc) >> 2;
char_array_4[1] = ((char_array_3[0] & 0x03) << 4) + ((char_array_3[1] & 0xf0) >> 4);
char_array_4[2] = ((char_array_3[1] & 0x0f) << 2) + ((char_array_3[2] & 0xc0) >> 6);
char_array_4[3] = char_array_3[2] & 0x3f;
for (i = 0; i < 4; i++) {
encoded += base64_chars[char_array_4[i]];
}
i = 0;
}
}
if (i) {
for (int j = i; j < 3; j++) {
char_array_3[j] = '\0';
}
char_array_4[0] = (char_array_3[0] & 0xfc) >> 2;
char_array_4[1] = ((char_array_3[0] & 0x03) << 4) + ((char_array_3[1] & 0xf0) >> 4);
char_array_4[2] = ((char_array_3[1] & 0x0f) << 2) + ((char_array_3[2] & 0xc0) >> 6);
for (int j = 0; j < i + 1; j++) {
encoded += base64_chars[char_array_4[j]];
}
while (i++ < 3) {
encoded += '=';
}
}
return encoded;
}
string base64Decode(const string& encoded) {
int in_len = encoded.size();
int i = 0, j = 0, in = 0;
unsigned char char_array_4[4], char_array_3[3];
string decoded;
while (in_len-- && (encoded[in] != '=') && isalnum(encoded[in]) || encoded[in] == '+' || encoded[in] == '/') {
char_array_4[i++] = encoded[in++];
if (i == 4) {
for (i = 0; i < 4; i++) {
char_array_4[i] = base64_chars.find(char_array_4[i]);
}
char_array_3[0] = (char_array_4[0] << 2) + ((char_array_4[1] & 0x30) >> 4);
char_array_3[1] = ((char_array_4[1] & 0xf) << 4) + ((char_array_4[2] & 0x3c) >> 2);
char_array_3[2] = ((char_array_4[2] & 0x3) << 6) + char_array_4[3];
for (i = 0; i < 3; i++) {
decoded += char_array_3[i];
}
i = 0;
}
}
if (i) {
for (j = 0; j < i; j++) {
char_array_4[j] = base64_chars.find(char_array_4[j]);
}
char_array_3[0] = (char_array_4[0] << 2) + ((char_array_4[1] & 0x30) >> 4);
char_array_3[1] = ((char_array_4[1] & 0xf) << 4) + ((char_array_4[2] & 0x3c) >> 2);
for (j = 0; j < i - 1; j++) {
decoded += char_array_3[j];
}
}
return decoded;
}
int main() {
string input = "Hello, World!";
cout << "Original: " << input << endl;
string encoded = base64Encode((unsigned char*)input.c_str(), input.length());
cout << "Base64: " << encoded << endl;
string decoded = base64Decode(encoded);
cout << "Decoded: " << decoded << endl;
return 0;
}
Java Implementation (using built-in)
import java.util.Base64;
import java.nio.charset.StandardCharsets;
public class Base64Encoding {
public static void main(String[] args) {
String input = "Hello, World!";
// Encode
String encoded = Base64.getEncoder().encodeToString(input.getBytes());
System.out.println("Original: " + input);
System.out.println("Base64: " + encoded);
// Decode
byte[] decoded = Base64.getDecoder().decode(encoded);
String decodedStr = new String(decoded, StandardCharsets.UTF_8);
System.out.println("Decoded: " + decodedStr);
}
}
URL Encoding (Percent Encoding)
URL encoding: Encodes special characters as %XX where XX is hexadecimal.
Characters Requiring Encoding
Space: %20
!: %21
#: %23
$: %24
%: %25
&: %26
': %27
(: %28
): %29
C++ Implementation
#include <iostream>
#include <string>
#include <sstream>
#include <iomanip>
using namespace std;
string urlEncode(const string& str) {
ostringstream escaped;
escaped.fill('0');
escaped << hex;
for (char c : str) {
if (isalnum(c) || c == '-' || c == '_' || c == '.' || c == '~') {
escaped << c;
}
else {
escaped << uppercase;
escaped << '%' << setw(2) << int((unsigned char)c);
escaped << nouppercase;
}
}
return escaped.str();
}
string urlDecode(const string& str) {
string decoded;
for (size_t i = 0; i < str.length(); i++) {
if (str[i] == '%') {
if (i + 2 < str.length()) {
string hex = str.substr(i + 1, 2);
int value = stoi(hex, nullptr, 16);
decoded += static_cast<char>(value);
i += 2;
}
}
else if (str[i] == '+') {
decoded += ' ';
}
else {
decoded += str[i];
}
}
return decoded;
}
int main() {
string url = "Hello World! @#$";
cout << "Original: " << url << endl;
string encoded = urlEncode(url);
cout << "URL Encoded: " << encoded << endl;
string decoded = urlDecode(encoded);
cout << "Decoded: " << decoded << endl;
return 0;
}
Comparison of Encoding Schemes
| Encoding | Purpose | Output Size | Use Case |
|---|---|---|---|
| ASCII | Text | 1 byte/char | English text |
| UTF-8 | Unicode text | 1-4 bytes/char | International text |
| Hex | Binary display | 2 chars/byte | Debugging, hashes |
| Base64 | Binary in text | ~133% | Email, JSON, URLs |
| URL Encoding | Web URLs | Varies | Query strings |
Common Mistakes
- Confusing encoding with encryption - Encoding is not secure
- Not handling padding - Base64 requires proper padding
- Character set issues - Assuming ASCII when UTF-8 is needed
- Buffer overflows - Not allocating enough space for encoded data
- Double encoding - Encoding already encoded data
- Wrong URL encoding - Not encoding all special characters
- Byte order issues - Endianness in multi-byte encodings
Debugging Tips
- Use online tools - Verify encoding/decoding
- Print byte values - Inspect actual bytes
- Test with known inputs - "Man" → "TWFu" for Base64
- Check padding - Base64 padding must be correct
- Validate UTF-8 - Ensure valid byte sequences
- Compare with standards - RFC specifications
- Handle edge cases - Empty strings, special characters
Mini Exercises
- Implement Base32 encoding
- Convert between different encodings
- Validate UTF-8 sequences
- Implement ROT13 cipher
- Create ASCII art encoder
- Build HTML entity encoder
- Implement quoted-printable encoding
- Create Unicode normalizer
- Build Base85 (ASCII85) encoder
- Implement punycode for IDN
Review Questions
- What's the difference between encoding and encryption?
- Why is Base64 used for binary data in emails?
- How does UTF-8 maintain ASCII compatibility?
- When should you use URL encoding?
- What are the advantages of hexadecimal representation?
Reference Checklist
By the end of this chapter, you should be able to:
- Convert between ASCII and binary
- Understand UTF-8 encoding
- Implement hexadecimal encoding/decoding
- Implement Base64 encoding/decoding
- Use URL encoding for web applications
- Choose appropriate encoding for use case
- Handle multi-byte character encodings
- Debug encoding issues
Next Steps
Chapter 18 explores advanced algorithm paradigms—Divide & Conquer, Greedy, Dynamic Programming, and Backtracking. These fundamental techniques solve complex problems efficiently.
Key Takeaway: Encoding transforms data representation without changing meaning. Base64 enables binary data in text formats. UTF-8 supports international characters. Hexadecimal provides readable binary representation. Understanding encoding is essential for web development, data interchange, and system integration.