Chapter 10: Advanced Assembly Programming
Introduction
Moving beyond basic assembly syntax, advanced assembly programming involves implementing complex algorithms, data structures, and interfacing with high-level code. This chapter explores real-world assembly programming techniques used in bootloaders, kernels, and performance-critical system components.
Why This Matters
While most code is written in C or higher-level languages, certain tasks require assembly: bootloader initialization, context switching, interrupt handlers, CPU-specific optimizations, and inline assembly in kernels. Understanding advanced assembly makes you capable of working on the lowest-level system code.
How to Study This Chapter
- Practice incrementally - Start with simple programs, add complexity gradually
- Use debuggers - GDB and LLDB are essential for understanding assembly flow
- Compare with C - Compile C to assembly to see patterns
- Test thoroughly - Assembly bugs can be subtle and dangerous
Implementing Data Structures in Assembly
Arrays
Arrays in assembly are just consecutive memory locations.
x64 Example: Sum array elements
section .data
array dd 10, 20, 30, 40, 50 ; 5 integers
length equ 5
section .bss sum resd 1 ; Reserve space for sum
section .text global main
main: push rbp mov rbp, rsp
xor rax, rax ; sum = 0
xor rcx, rcx ; index = 0
.loop: cmp rcx, length ; if (index >= length) jge .done
add eax, [array + rcx*4] ; sum += array[index]
inc rcx ; index++
jmp .loop
.done: mov [sum], eax ; Store result
pop rbp
ret
ARM Example: Array access
.data
array: .word 10, 20, 30, 40, 50
length: .word 5
.text .global main
main: push {lr}
ldr r1, =array @ r1 = array address
ldr r2, =length @ r2 = length address
ldr r2, [r2] @ r2 = length value
mov r0, #0 @ sum = 0
mov r3, #0 @ index = 0
loop: cmp r3, r2 @ if (index >= length) bge done
ldr r4, [r1, r3, lsl #2] @ r4 = array[index]
add r0, r0, r4 @ sum += r4
add r3, r3, #1 @ index++
b loop
done: pop {pc} @ Return
Strings
Strings are null-terminated character arrays.
x64 Example: String length
section .text
global strlen
; size_t strlen(const char *str) ; Input: rdi = string pointer ; Output: rax = length
strlen: xor rax, rax ; length = 0
.loop: cmp byte [rdi + rax], 0 ; if (*str == '\0') je .done inc rax ; length++ jmp .loop
.done: ret
String copy (strcpy)
global strcpy
; char* strcpy(char *dest, const char *src) ; Input: rdi = dest, rsi = src ; Output: rax = dest
strcpy: push rbp mov rbp, rsp
mov rax, rdi ; Save original dest
.loop: mov cl, [rsi] ; Load byte from src mov [rdi], cl ; Store byte to dest inc rsi inc rdi test cl, cl ; Check if null jnz .loop ; Continue if not null
pop rbp
ret
Structures (Structs)
Accessing struct members uses offsets.
C Structure:
struct Point {
int x; // offset 0
int y; // offset 4
};
x64 Assembly:
section .data
point1:
dd 10 ; x = 10
dd 20 ; y = 20
section .text global get_x
; int get_x(struct Point *p) get_x: mov eax, [rdi] ; Return p->x (offset 0) ret
global get_y get_y: mov eax, [rdi + 4] ; Return p->y (offset 4) ret
global set_point ; void set_point(struct Point *p, int x, int y) ; rdi = p, esi = x, edx = y set_point: mov [rdi], esi ; p->x = x mov [rdi + 4], edx ; p->y = y ret
Stack Frames and Function Calls
Standard Function Prologue and Epilogue
x64:
function:
; Prologue
push rbp ; Save old frame pointer
mov rbp, rsp ; Set new frame pointer
sub rsp, 32 ; Allocate 32 bytes local space
; Function body
; [rbp - 8] = local variable 1
; [rbp - 16] = local variable 2
; Epilogue
mov rsp, rbp ; Restore stack pointer
pop rbp ; Restore frame pointer
ret
ARM:
function:
@ Prologue
push {fp, lr} @ Save frame pointer and return address
mov fp, sp @ Set frame pointer
sub sp, sp, #16 @ Allocate local space
@ Function body
@ Epilogue
mov sp, fp @ Restore stack pointer
pop {fp, pc} @ Restore and return
Calling Conventions in Practice
x64 System V (Linux/macOS):
; int add_three(int a, int b, int c)
add_three:
mov eax, edi ; a is in edi
add eax, esi ; b is in esi
add eax, edx ; c is in edx
ret
; Calling add_three(5, 10, 15) main: mov edi, 5 ; First argument mov esi, 10 ; Second argument mov edx, 15 ; Third argument call add_three ; Result in eax (30) ret
x64 Microsoft (Windows):
; int add_three(int a, int b, int c)
add_three:
mov eax, ecx ; a is in ecx
add eax, edx ; b is in edx
add eax, r8d ; c is in r8d
ret
; Calling requires shadow space main: sub rsp, 32 ; Allocate shadow space mov ecx, 5 mov edx, 10 mov r8d, 15 call add_three add rsp, 32 ; Clean up shadow space ret
Inline Assembly in C
Mixing C and assembly for critical sections.
x64 GCC/Clang Inline Assembly
Basic Syntax:
asm("assembly code"
: output operands
: input operands
: clobbered registers
);
Examples:
// Read CPU timestamp counter
static inline uint64_t rdtsc(void) {
uint32_t lo, hi;
asm volatile("rdtsc" : "=a"(lo), "=d"(hi));
return ((uint64_t)hi << 32) | lo;
}
// Atomic increment static inline void atomic_inc(int *ptr) { asm volatile("lock incl %0" : "+m"(*ptr) : : "memory"); }
// Get CPU ID static inline void cpuid(uint32_t code, uint32_t *a, uint32_t *d) { asm volatile("cpuid" : "=a"(*a), "=d"(*d) : "a"(code) : "ebx", "ecx"); }
// Spinlock static inline void spin_lock(volatile int *lock) { asm volatile( "1: mov $1, %%eax\n" " xchg %%eax, %0\n" " test %%eax, %%eax\n" " jnz 1b\n" : "+m"(*lock) : : "eax", "memory" ); }
ARM Inline Assembly
// Get current program status register
static inline uint32_t get_cpsr(void) {
uint32_t cpsr;
asm volatile("mrs %0, cpsr" : "=r"(cpsr));
return cpsr;
}
// Memory barrier static inline void memory_barrier(void) { asm volatile("dmb" ::: "memory"); }
// Enable interrupts static inline void enable_interrupts(void) { asm volatile("cpsie i"); }
Advanced Control Flow
Switch Statements (Jump Tables)
Efficient switch implementation using jump tables.
C Code:
int handle_command(int cmd) {
switch(cmd) {
case 0: return do_read();
case 1: return do_write();
case 2: return do_close();
default: return -1;
}
}
x64 Assembly with Jump Table:
section .text
global handle_command
handle_command: ; edi contains cmd cmp edi, 2 ; Check if cmd > 2 ja .default ; If yes, default case
lea rax, [jump_table] ; Load jump table address
mov rax, [rax + rdi*8] ; Get function pointer
jmp rax ; Jump to handler
.case0: call do_read ret
.case1: call do_write ret
.case2: call do_close ret
.default: mov eax, -1 ret
section .rodata jump_table: dq handle_command.case0 dq handle_command.case1 dq handle_command.case2
Bitwise Operations and Flags
Setting, Clearing, Testing Bits
; Set bit n in register
; Input: eax = value, ecx = bit number
set_bit:
bts eax, ecx ; Bit test and set
ret
; Clear bit n clear_bit: btr eax, ecx ; Bit test and reset ret
; Test bit n (result in CF flag) test_bit: bt eax, ecx ; Bit test setc al ; Set al if CF=1 movzx eax, al ret
; Count leading zeros count_leading_zeros: bsr ecx, eax ; Bit scan reverse jz .zero xor ecx, 31 ; Convert to leading zeros mov eax, ecx ret .zero: mov eax, 32 ret
Working with Flags
; Save and restore flags
save_flags:
pushf ; Push flags onto stack
pop rax ; Pop into rax
ret
restore_flags: push rdi ; rdi contains flags popf ; Pop into flags register ret
; Conditional moves based on flags conditional_max: ; max(eax, ebx) -> eax cmp eax, ebx cmovl eax, ebx ; if (eax < ebx) eax = ebx ret
Loops and Optimization
Loop Unrolling
Trade code size for speed by reducing loop overhead.
Standard Loop:
; Sum array of 1000 elements
sum_array:
xor eax, eax
xor ecx, ecx
.loop:
add eax, [rdi + rcx*4]
inc rcx
cmp rcx, 1000
jl .loop
ret
Unrolled Loop (4x):
sum_array_unrolled:
xor eax, eax
xor ecx, ecx
.loop:
add eax, [rdi + rcx*4] ; Iteration 1
add eax, [rdi + rcx*4 + 4] ; Iteration 2
add eax, [rdi + rcx*4 + 8] ; Iteration 3
add eax, [rdi + rcx*4 + 12] ; Iteration 4
add rcx, 4
cmp rcx, 1000
jl .loop
ret
SIMD Operations
Using vector instructions for parallel processing.
x64 SSE Example:
; Add two arrays of floats (4 at a time)
; rdi = array1, rsi = array2, rdx = result, rcx = count
add_arrays_sse:
xor rax, rax
.loop:
cmp rax, rcx
jge .done
movaps xmm0, [rdi + rax*4] ; Load 4 floats from array1
movaps xmm1, [rsi + rax*4] ; Load 4 floats from array2
addps xmm0, xmm1 ; Add 4 floats in parallel
movaps [rdx + rax*4], xmm0 ; Store result
add rax, 4 ; Process 4 elements per iteration
jmp .loop
.done: ret
System Instructions
CPU Control Instructions
; Halt the CPU
halt:
hlt ; Halt until interrupt
ret
; No operation (delay/alignment) nop_delay: nop nop nop ret
; Disable/enable interrupts disable_interrupts: cli ; Clear interrupt flag ret
enable_interrupts: sti ; Set interrupt flag ret
; Read/write control registers (x64) read_cr0: mov rax, cr0 ret
write_cr0: mov cr0, rdi ret
Port I/O (x86/x64)
; Read byte from I/O port
; Input: edi = port
; Output: al = value
inb:
mov edx, edi
in al, dx
ret
; Write byte to I/O port ; Input: edi = port, esi = value outb: mov edx, edi mov eax, esi out dx, al ret
Interfacing Assembly with C
Calling C from Assembly
x64:
extern printf ; Declare external C function
section .data format db "Value: %d", 10, 0
section .text global asm_print
asm_print: push rbp mov rbp, rsp
; Call printf(format, 42)
mov rdi, format ; First arg: format string
mov esi, 42 ; Second arg: value
xor eax, eax ; No floating-point args
call printf
pop rbp
ret
Calling Assembly from C
C code:
// Function declarations
extern int asm_add(int a, int b);
extern void asm_print_array(int *arr, int len);
int main() { int result = asm_add(5, 10);
int numbers[] = {1, 2, 3, 4, 5};
asm_print_array(numbers, 5);
return 0;
}
Assembly implementation:
global asm_add
global asm_print_array
; int asm_add(int a, int b) asm_add: mov eax, edi add eax, esi ret
; void asm_print_array(int *arr, int len) asm_print_array: push rbp mov rbp, rsp push rbx push r12 push r13
mov rbx, rdi ; arr
mov r12, rsi ; len
xor r13, r13 ; index
.loop: cmp r13, r12 jge .done
mov edi, [rbx + r13*4]
call print_number ; External C function
inc r13
jmp .loop
.done: pop r13 pop r12 pop rbx pop rbp ret
Debugging Assembly Code
Using GDB
# Compile with debug symbols
nasm -f elf64 -g -F dwarf program.asm
gcc -g -o program program.o
Start GDB
gdb ./program
GDB commands
(gdb) break main # Set breakpoint (gdb) run # Run program (gdb) layout asm # Show assembly view (gdb) stepi # Step one instruction (gdb) info registers # Show all registers (gdb) x/10x $rsp # Examine stack (gdb) disassemble main # Disassemble function
Common Debugging Techniques
; Add debug markers
debug_point:
nop ; Breakpoint location
nop
nop
; Print register values (using C printf) debug_print_rax: push rdi push rsi push rax
mov rdi, debug_fmt
mov rsi, rax
xor eax, eax
call printf
pop rax
pop rsi
pop rdi
ret
section .data debug_fmt: db "RAX = 0x%lx", 10, 0
Key Concepts
- Arrays are accessed using base + index * element_size
- Strings are null-terminated character arrays
- Structures use fixed offsets for member access
- Stack frames manage local variables and function calls
- Inline assembly integrates assembly in C code
- Jump tables optimize switch statements
- Loop unrolling trades space for speed
- SIMD enables parallel data processing
- Port I/O accesses hardware devices (x86/x64)
Common Mistakes
- Wrong calling convention - Check ABI for your platform
- Stack misalignment - x64 requires 16-byte alignment
- Register clobbering - Save/restore callee-saved registers
- Incorrect offsets - Struct member offsets must match C
- Missing null checks - Dereference can crash
- Buffer overruns - Always bounds-check arrays
- Forgetting stack cleanup - Leads to stack corruption
Debugging Tips
- Use debugger step-by-step - Don't guess, watch execution
- Check register values - Verify each computation
- Examine stack - Look for corruption or misalignment
- Add debug prints - Print intermediate values
- Compare with C version - Compile C to assembly and compare
- Test small - Verify each function independently
- Check flags - Conditional jumps depend on flags
Mini Exercises
- Implement
strcmpin assembly (compare two strings) - Write assembly function to reverse an array in-place
- Implement bubble sort in assembly
- Create a function that counts set bits (popcount)
- Write assembly to find min/max in an array
- Implement memcpy in assembly
- Create a factorial function (iterative and recursive)
- Write inline assembly to read CPU timestamp counter
- Implement a simple hash function
- Create assembly function to swap two integers using XOR
Review Questions
- How do you access the third element of an integer array in assembly?
- What's the difference between
callandjmp? - Why must you save callee-saved registers in a function?
- How does a jump table optimize switch statements?
- What's the purpose of function prologue and epilogue?
Reference Checklist
By the end of this chapter, you should be able to:
- Implement arrays and strings in assembly
- Access struct members correctly
- Write proper function prologues and epilogues
- Follow calling conventions for your architecture
- Use inline assembly in C
- Implement jump tables for switch statements
- Perform bitwise operations and flag manipulation
- Optimize loops with unrolling and SIMD
- Call C functions from assembly and vice versa
- Debug assembly code with GDB
Next Steps
With advanced assembly skills, you're ready to tackle memory management. The next chapter explores how operating systems manage memory: virtual memory, paging, segmentation, and the Memory Management Unit (MMU).
Key Takeaway: Advanced assembly programming requires understanding data structures, calling conventions, and optimization techniques. Combined with debugging skills and C integration, you can write and maintain low-level system code effectively.