Chapter 10: Advanced Assembly Programming
Introduction
Moving beyond basic assembly syntax, advanced assembly programming involves implementing complex algorithms, data structures, and interfacing with high-level code. This chapter explores real-world assembly programming techniques used in bootloaders, kernels, and performance-critical system components.
Why This Matters
While most code is written in C or higher-level languages, certain tasks require assembly: bootloader initialization, context switching, interrupt handlers, CPU-specific optimizations, and inline assembly in kernels. Understanding advanced assembly makes you capable of working on the lowest-level system code.
How to Study This Chapter
- Practice incrementally - Start with simple programs, add complexity gradually
- Use debuggers - GDB and LLDB are essential for understanding assembly flow
- Compare with C - Compile C to assembly to see patterns
- Test thoroughly - Assembly bugs can be subtle and dangerous
Implementing Data Structures in Assembly
Arrays
Arrays in assembly are just consecutive memory locations.
x64 Example: Sum array elements
section .data
array dd 10, 20, 30, 40, 50 ; 5 integers
length equ 5
section .bss
sum resd 1 ; Reserve space for sum
section .text
global main
main:
push rbp
mov rbp, rsp
xor rax, rax ; sum = 0
xor rcx, rcx ; index = 0
.loop:
cmp rcx, length ; if (index >= length)
jge .done
add eax, [array + rcx*4] ; sum += array[index]
inc rcx ; index++
jmp .loop
.done:
mov [sum], eax ; Store result
pop rbp
ret
ARM Example: Array access
.data
array: .word 10, 20, 30, 40, 50
length: .word 5
.text
.global main
main:
push {lr}
ldr r1, =array @ r1 = array address
ldr r2, =length @ r2 = length address
ldr r2, [r2] @ r2 = length value
mov r0, #0 @ sum = 0
mov r3, #0 @ index = 0
loop:
cmp r3, r2 @ if (index >= length)
bge done
ldr r4, [r1, r3, lsl #2] @ r4 = array[index]
add r0, r0, r4 @ sum += r4
add r3, r3, #1 @ index++
b loop
done:
pop {pc} @ Return
Strings
Strings are null-terminated character arrays.
x64 Example: String length
section .text
global strlen
; size_t strlen(const char *str)
; Input: rdi = string pointer
; Output: rax = length
strlen:
xor rax, rax ; length = 0
.loop:
cmp byte [rdi + rax], 0 ; if (*str == '\0')
je .done
inc rax ; length++
jmp .loop
.done:
ret
String copy (strcpy)
global strcpy
; char* strcpy(char *dest, const char *src)
; Input: rdi = dest, rsi = src
; Output: rax = dest
strcpy:
push rbp
mov rbp, rsp
mov rax, rdi ; Save original dest
.loop:
mov cl, [rsi] ; Load byte from src
mov [rdi], cl ; Store byte to dest
inc rsi
inc rdi
test cl, cl ; Check if null
jnz .loop ; Continue if not null
pop rbp
ret
Structures (Structs)
Accessing struct members uses offsets.
C Structure:
struct Point {
int x; // offset 0
int y; // offset 4
};
x64 Assembly:
section .data
point1:
dd 10 ; x = 10
dd 20 ; y = 20
section .text
global get_x
; int get_x(struct Point *p)
get_x:
mov eax, [rdi] ; Return p->x (offset 0)
ret
global get_y
get_y:
mov eax, [rdi + 4] ; Return p->y (offset 4)
ret
global set_point
; void set_point(struct Point *p, int x, int y)
; rdi = p, esi = x, edx = y
set_point:
mov [rdi], esi ; p->x = x
mov [rdi + 4], edx ; p->y = y
ret
Stack Frames and Function Calls
Standard Function Prologue and Epilogue
x64:
function:
; Prologue
push rbp ; Save old frame pointer
mov rbp, rsp ; Set new frame pointer
sub rsp, 32 ; Allocate 32 bytes local space
; Function body
; [rbp - 8] = local variable 1
; [rbp - 16] = local variable 2
; Epilogue
mov rsp, rbp ; Restore stack pointer
pop rbp ; Restore frame pointer
ret
ARM:
function:
@ Prologue
push {fp, lr} @ Save frame pointer and return address
mov fp, sp @ Set frame pointer
sub sp, sp, #16 @ Allocate local space
@ Function body
@ Epilogue
mov sp, fp @ Restore stack pointer
pop {fp, pc} @ Restore and return
Calling Conventions in Practice
x64 System V (Linux/macOS):
; int add_three(int a, int b, int c)
add_three:
mov eax, edi ; a is in edi
add eax, esi ; b is in esi
add eax, edx ; c is in edx
ret
; Calling add_three(5, 10, 15)
main:
mov edi, 5 ; First argument
mov esi, 10 ; Second argument
mov edx, 15 ; Third argument
call add_three
; Result in eax (30)
ret
x64 Microsoft (Windows):
; int add_three(int a, int b, int c)
add_three:
mov eax, ecx ; a is in ecx
add eax, edx ; b is in edx
add eax, r8d ; c is in r8d
ret
; Calling requires shadow space
main:
sub rsp, 32 ; Allocate shadow space
mov ecx, 5
mov edx, 10
mov r8d, 15
call add_three
add rsp, 32 ; Clean up shadow space
ret
Inline Assembly in C
Mixing C and assembly for critical sections.
x64 GCC/Clang Inline Assembly
Basic Syntax:
asm("assembly code"
: output operands
: input operands
: clobbered registers
);
Examples:
// Read CPU timestamp counter
static inline uint64_t rdtsc(void) {
uint32_t lo, hi;
asm volatile("rdtsc" : "=a"(lo), "=d"(hi));
return ((uint64_t)hi << 32) | lo;
}
// Atomic increment
static inline void atomic_inc(int *ptr) {
asm volatile("lock incl %0"
: "+m"(*ptr)
:
: "memory");
}
// Get CPU ID
static inline void cpuid(uint32_t code, uint32_t *a, uint32_t *d) {
asm volatile("cpuid"
: "=a"(*a), "=d"(*d)
: "a"(code)
: "ebx", "ecx");
}
// Spinlock
static inline void spin_lock(volatile int *lock) {
asm volatile(
"1: mov $1, %%eax\n"
" xchg %%eax, %0\n"
" test %%eax, %%eax\n"
" jnz 1b\n"
: "+m"(*lock)
:
: "eax", "memory"
);
}
ARM Inline Assembly
// Get current program status register
static inline uint32_t get_cpsr(void) {
uint32_t cpsr;
asm volatile("mrs %0, cpsr" : "=r"(cpsr));
return cpsr;
}
// Memory barrier
static inline void memory_barrier(void) {
asm volatile("dmb" ::: "memory");
}
// Enable interrupts
static inline void enable_interrupts(void) {
asm volatile("cpsie i");
}
Advanced Control Flow
Switch Statements (Jump Tables)
Efficient switch implementation using jump tables.
C Code:
int handle_command(int cmd) {
switch(cmd) {
case 0: return do_read();
case 1: return do_write();
case 2: return do_close();
default: return -1;
}
}
x64 Assembly with Jump Table:
section .text
global handle_command
handle_command:
; edi contains cmd
cmp edi, 2 ; Check if cmd > 2
ja .default ; If yes, default case
lea rax, [jump_table] ; Load jump table address
mov rax, [rax + rdi*8] ; Get function pointer
jmp rax ; Jump to handler
.case0:
call do_read
ret
.case1:
call do_write
ret
.case2:
call do_close
ret
.default:
mov eax, -1
ret
section .rodata
jump_table:
dq handle_command.case0
dq handle_command.case1
dq handle_command.case2
Bitwise Operations and Flags
Setting, Clearing, Testing Bits
; Set bit n in register
; Input: eax = value, ecx = bit number
set_bit:
bts eax, ecx ; Bit test and set
ret
; Clear bit n
clear_bit:
btr eax, ecx ; Bit test and reset
ret
; Test bit n (result in CF flag)
test_bit:
bt eax, ecx ; Bit test
setc al ; Set al if CF=1
movzx eax, al
ret
; Count leading zeros
count_leading_zeros:
bsr ecx, eax ; Bit scan reverse
jz .zero
xor ecx, 31 ; Convert to leading zeros
mov eax, ecx
ret
.zero:
mov eax, 32
ret
Working with Flags
; Save and restore flags
save_flags:
pushf ; Push flags onto stack
pop rax ; Pop into rax
ret
restore_flags:
push rdi ; rdi contains flags
popf ; Pop into flags register
ret
; Conditional moves based on flags
conditional_max:
; max(eax, ebx) -> eax
cmp eax, ebx
cmovl eax, ebx ; if (eax < ebx) eax = ebx
ret
Loops and Optimization
Loop Unrolling
Trade code size for speed by reducing loop overhead.
Standard Loop:
; Sum array of 1000 elements
sum_array:
xor eax, eax
xor ecx, ecx
.loop:
add eax, [rdi + rcx*4]
inc rcx
cmp rcx, 1000
jl .loop
ret
Unrolled Loop (4x):
sum_array_unrolled:
xor eax, eax
xor ecx, ecx
.loop:
add eax, [rdi + rcx*4] ; Iteration 1
add eax, [rdi + rcx*4 + 4] ; Iteration 2
add eax, [rdi + rcx*4 + 8] ; Iteration 3
add eax, [rdi + rcx*4 + 12] ; Iteration 4
add rcx, 4
cmp rcx, 1000
jl .loop
ret
SIMD Operations
Using vector instructions for parallel processing.
x64 SSE Example:
; Add two arrays of floats (4 at a time)
; rdi = array1, rsi = array2, rdx = result, rcx = count
add_arrays_sse:
xor rax, rax
.loop:
cmp rax, rcx
jge .done
movaps xmm0, [rdi + rax*4] ; Load 4 floats from array1
movaps xmm1, [rsi + rax*4] ; Load 4 floats from array2
addps xmm0, xmm1 ; Add 4 floats in parallel
movaps [rdx + rax*4], xmm0 ; Store result
add rax, 4 ; Process 4 elements per iteration
jmp .loop
.done:
ret
System Instructions
CPU Control Instructions
; Halt the CPU
halt:
hlt ; Halt until interrupt
ret
; No operation (delay/alignment)
nop_delay:
nop
nop
nop
ret
; Disable/enable interrupts
disable_interrupts:
cli ; Clear interrupt flag
ret
enable_interrupts:
sti ; Set interrupt flag
ret
; Read/write control registers (x64)
read_cr0:
mov rax, cr0
ret
write_cr0:
mov cr0, rdi
ret
Port I/O (x86/x64)
; Read byte from I/O port
; Input: edi = port
; Output: al = value
inb:
mov edx, edi
in al, dx
ret
; Write byte to I/O port
; Input: edi = port, esi = value
outb:
mov edx, edi
mov eax, esi
out dx, al
ret
Interfacing Assembly with C
Calling C from Assembly
x64:
extern printf ; Declare external C function
section .data
format db "Value: %d", 10, 0
section .text
global asm_print
asm_print:
push rbp
mov rbp, rsp
; Call printf(format, 42)
mov rdi, format ; First arg: format string
mov esi, 42 ; Second arg: value
xor eax, eax ; No floating-point args
call printf
pop rbp
ret
Calling Assembly from C
C code:
// Function declarations
extern int asm_add(int a, int b);
extern void asm_print_array(int *arr, int len);
int main() {
int result = asm_add(5, 10);
int numbers[] = {1, 2, 3, 4, 5};
asm_print_array(numbers, 5);
return 0;
}
Assembly implementation:
global asm_add
global asm_print_array
; int asm_add(int a, int b)
asm_add:
mov eax, edi
add eax, esi
ret
; void asm_print_array(int *arr, int len)
asm_print_array:
push rbp
mov rbp, rsp
push rbx
push r12
push r13
mov rbx, rdi ; arr
mov r12, rsi ; len
xor r13, r13 ; index
.loop:
cmp r13, r12
jge .done
mov edi, [rbx + r13*4]
call print_number ; External C function
inc r13
jmp .loop
.done:
pop r13
pop r12
pop rbx
pop rbp
ret
Debugging Assembly Code
Using GDB
# Compile with debug symbols
nasm -f elf64 -g -F dwarf program.asm
gcc -g -o program program.o
# Start GDB
gdb ./program
# GDB commands
(gdb) break main # Set breakpoint
(gdb) run # Run program
(gdb) layout asm # Show assembly view
(gdb) stepi # Step one instruction
(gdb) info registers # Show all registers
(gdb) x/10x $rsp # Examine stack
(gdb) disassemble main # Disassemble function
Common Debugging Techniques
; Add debug markers
debug_point:
nop ; Breakpoint location
nop
nop
; Print register values (using C printf)
debug_print_rax:
push rdi
push rsi
push rax
mov rdi, debug_fmt
mov rsi, rax
xor eax, eax
call printf
pop rax
pop rsi
pop rdi
ret
section .data
debug_fmt: db "RAX = 0x%lx", 10, 0
Key Concepts
- Arrays are accessed using base + index * element_size
- Strings are null-terminated character arrays
- Structures use fixed offsets for member access
- Stack frames manage local variables and function calls
- Inline assembly integrates assembly in C code
- Jump tables optimize switch statements
- Loop unrolling trades space for speed
- SIMD enables parallel data processing
- Port I/O accesses hardware devices (x86/x64)
Common Mistakes
- Wrong calling convention - Check ABI for your platform
- Stack misalignment - x64 requires 16-byte alignment
- Register clobbering - Save/restore callee-saved registers
- Incorrect offsets - Struct member offsets must match C
- Missing null checks - Dereference can crash
- Buffer overruns - Always bounds-check arrays
- Forgetting stack cleanup - Leads to stack corruption
Debugging Tips
- Use debugger step-by-step - Don't guess, watch execution
- Check register values - Verify each computation
- Examine stack - Look for corruption or misalignment
- Add debug prints - Print intermediate values
- Compare with C version - Compile C to assembly and compare
- Test small - Verify each function independently
- Check flags - Conditional jumps depend on flags
Mini Exercises
- Implement
strcmpin assembly (compare two strings) - Write assembly function to reverse an array in-place
- Implement bubble sort in assembly
- Create a function that counts set bits (popcount)
- Write assembly to find min/max in an array
- Implement memcpy in assembly
- Create a factorial function (iterative and recursive)
- Write inline assembly to read CPU timestamp counter
- Implement a simple hash function
- Create assembly function to swap two integers using XOR
Review Questions
- How do you access the third element of an integer array in assembly?
- What's the difference between
callandjmp? - Why must you save callee-saved registers in a function?
- How does a jump table optimize switch statements?
- What's the purpose of function prologue and epilogue?
Reference Checklist
By the end of this chapter, you should be able to:
- Implement arrays and strings in assembly
- Access struct members correctly
- Write proper function prologues and epilogues
- Follow calling conventions for your architecture
- Use inline assembly in C
- Implement jump tables for switch statements
- Perform bitwise operations and flag manipulation
- Optimize loops with unrolling and SIMD
- Call C functions from assembly and vice versa
- Debug assembly code with GDB
Next Steps
With advanced assembly skills, you're ready to tackle memory management. The next chapter explores how operating systems manage memory: virtual memory, paging, segmentation, and the Memory Management Unit (MMU).
Key Takeaway: Advanced assembly programming requires understanding data structures, calling conventions, and optimization techniques. Combined with debugging skills and C integration, you can write and maintain low-level system code effectively.