Advanced Assembly Programming

Chapter 10: Advanced Assembly Programming

Introduction

Moving beyond basic assembly syntax, advanced assembly programming involves implementing complex algorithms, data structures, and interfacing with high-level code. This chapter explores real-world assembly programming techniques used in bootloaders, kernels, and performance-critical system components.

Why This Matters

While most code is written in C or higher-level languages, certain tasks require assembly: bootloader initialization, context switching, interrupt handlers, CPU-specific optimizations, and inline assembly in kernels. Understanding advanced assembly makes you capable of working on the lowest-level system code.

How to Study This Chapter

  1. Practice incrementally - Start with simple programs, add complexity gradually
  2. Use debuggers - GDB and LLDB are essential for understanding assembly flow
  3. Compare with C - Compile C to assembly to see patterns
  4. Test thoroughly - Assembly bugs can be subtle and dangerous

Implementing Data Structures in Assembly

Arrays

Arrays in assembly are just consecutive memory locations.

x64 Example: Sum array elements

section .data
    array dd 10, 20, 30, 40, 50    ; 5 integers
    length equ 5

section .bss
    sum resd 1                      ; Reserve space for sum

section .text
global main

main:
    push rbp
    mov rbp, rsp

    xor rax, rax                    ; sum = 0
    xor rcx, rcx                    ; index = 0

.loop:
    cmp rcx, length                 ; if (index >= length)
    jge .done

    add eax, [array + rcx*4]        ; sum += array[index]
    inc rcx                         ; index++
    jmp .loop

.done:
    mov [sum], eax                  ; Store result

    pop rbp
    ret

ARM Example: Array access

.data
    array: .word 10, 20, 30, 40, 50
    length: .word 5

.text
.global main

main:
    push {lr}

    ldr r1, =array                  @ r1 = array address
    ldr r2, =length                 @ r2 = length address
    ldr r2, [r2]                    @ r2 = length value

    mov r0, #0                      @ sum = 0
    mov r3, #0                      @ index = 0

loop:
    cmp r3, r2                      @ if (index >= length)
    bge done

    ldr r4, [r1, r3, lsl #2]        @ r4 = array[index]
    add r0, r0, r4                  @ sum += r4
    add r3, r3, #1                  @ index++
    b loop

done:
    pop {pc}                        @ Return

Strings

Strings are null-terminated character arrays.

x64 Example: String length

section .text
global strlen

; size_t strlen(const char *str)
; Input: rdi = string pointer
; Output: rax = length

strlen:
    xor rax, rax                    ; length = 0

.loop:
    cmp byte [rdi + rax], 0         ; if (*str == '\0')
    je .done
    inc rax                         ; length++
    jmp .loop

.done:
    ret

String copy (strcpy)

global strcpy

; char* strcpy(char *dest, const char *src)
; Input: rdi = dest, rsi = src
; Output: rax = dest

strcpy:
    push rbp
    mov rbp, rsp

    mov rax, rdi                    ; Save original dest

.loop:
    mov cl, [rsi]                   ; Load byte from src
    mov [rdi], cl                   ; Store byte to dest
    inc rsi
    inc rdi
    test cl, cl                     ; Check if null
    jnz .loop                       ; Continue if not null

    pop rbp
    ret

Structures (Structs)

Accessing struct members uses offsets.

C Structure:

struct Point {
    int x;      // offset 0
    int y;      // offset 4
};

x64 Assembly:

section .data
    point1:
        dd 10           ; x = 10
        dd 20           ; y = 20

section .text
global get_x

; int get_x(struct Point *p)
get_x:
    mov eax, [rdi]      ; Return p->x (offset 0)
    ret

global get_y
get_y:
    mov eax, [rdi + 4]  ; Return p->y (offset 4)
    ret

global set_point
; void set_point(struct Point *p, int x, int y)
; rdi = p, esi = x, edx = y
set_point:
    mov [rdi], esi      ; p->x = x
    mov [rdi + 4], edx  ; p->y = y
    ret

Stack Frames and Function Calls

Standard Function Prologue and Epilogue

x64:

function:
    ; Prologue
    push rbp                ; Save old frame pointer
    mov rbp, rsp            ; Set new frame pointer
    sub rsp, 32             ; Allocate 32 bytes local space

    ; Function body
    ; [rbp - 8] = local variable 1
    ; [rbp - 16] = local variable 2

    ; Epilogue
    mov rsp, rbp            ; Restore stack pointer
    pop rbp                 ; Restore frame pointer
    ret

ARM:

function:
    @ Prologue
    push {fp, lr}           @ Save frame pointer and return address
    mov fp, sp              @ Set frame pointer
    sub sp, sp, #16         @ Allocate local space

    @ Function body

    @ Epilogue
    mov sp, fp              @ Restore stack pointer
    pop {fp, pc}            @ Restore and return

Calling Conventions in Practice

x64 System V (Linux/macOS):

; int add_three(int a, int b, int c)
add_three:
    mov eax, edi            ; a is in edi
    add eax, esi            ; b is in esi
    add eax, edx            ; c is in edx
    ret

; Calling add_three(5, 10, 15)
main:
    mov edi, 5              ; First argument
    mov esi, 10             ; Second argument
    mov edx, 15             ; Third argument
    call add_three
    ; Result in eax (30)
    ret

x64 Microsoft (Windows):

; int add_three(int a, int b, int c)
add_three:
    mov eax, ecx            ; a is in ecx
    add eax, edx            ; b is in edx
    add eax, r8d            ; c is in r8d
    ret

; Calling requires shadow space
main:
    sub rsp, 32             ; Allocate shadow space
    mov ecx, 5
    mov edx, 10
    mov r8d, 15
    call add_three
    add rsp, 32             ; Clean up shadow space
    ret

Inline Assembly in C

Mixing C and assembly for critical sections.

x64 GCC/Clang Inline Assembly

Basic Syntax:

asm("assembly code"
    : output operands
    : input operands
    : clobbered registers
);

Examples:

// Read CPU timestamp counter
static inline uint64_t rdtsc(void) {
    uint32_t lo, hi;
    asm volatile("rdtsc" : "=a"(lo), "=d"(hi));
    return ((uint64_t)hi << 32) | lo;
}

// Atomic increment
static inline void atomic_inc(int *ptr) {
    asm volatile("lock incl %0"
                 : "+m"(*ptr)
                 :
                 : "memory");
}

// Get CPU ID
static inline void cpuid(uint32_t code, uint32_t *a, uint32_t *d) {
    asm volatile("cpuid"
                 : "=a"(*a), "=d"(*d)
                 : "a"(code)
                 : "ebx", "ecx");
}

// Spinlock
static inline void spin_lock(volatile int *lock) {
    asm volatile(
        "1: mov $1, %%eax\n"
        "   xchg %%eax, %0\n"
        "   test %%eax, %%eax\n"
        "   jnz 1b\n"
        : "+m"(*lock)
        :
        : "eax", "memory"
    );
}

ARM Inline Assembly

// Get current program status register
static inline uint32_t get_cpsr(void) {
    uint32_t cpsr;
    asm volatile("mrs %0, cpsr" : "=r"(cpsr));
    return cpsr;
}

// Memory barrier
static inline void memory_barrier(void) {
    asm volatile("dmb" ::: "memory");
}

// Enable interrupts
static inline void enable_interrupts(void) {
    asm volatile("cpsie i");
}

Advanced Control Flow

Switch Statements (Jump Tables)

Efficient switch implementation using jump tables.

C Code:

int handle_command(int cmd) {
    switch(cmd) {
        case 0: return do_read();
        case 1: return do_write();
        case 2: return do_close();
        default: return -1;
    }
}

x64 Assembly with Jump Table:

section .text
global handle_command

handle_command:
    ; edi contains cmd
    cmp edi, 2                  ; Check if cmd > 2
    ja .default                 ; If yes, default case

    lea rax, [jump_table]       ; Load jump table address
    mov rax, [rax + rdi*8]      ; Get function pointer
    jmp rax                     ; Jump to handler

.case0:
    call do_read
    ret

.case1:
    call do_write
    ret

.case2:
    call do_close
    ret

.default:
    mov eax, -1
    ret

section .rodata
jump_table:
    dq handle_command.case0
    dq handle_command.case1
    dq handle_command.case2

Bitwise Operations and Flags

Setting, Clearing, Testing Bits

; Set bit n in register
; Input: eax = value, ecx = bit number
set_bit:
    bts eax, ecx                ; Bit test and set
    ret

; Clear bit n
clear_bit:
    btr eax, ecx                ; Bit test and reset
    ret

; Test bit n (result in CF flag)
test_bit:
    bt eax, ecx                 ; Bit test
    setc al                     ; Set al if CF=1
    movzx eax, al
    ret

; Count leading zeros
count_leading_zeros:
    bsr ecx, eax                ; Bit scan reverse
    jz .zero
    xor ecx, 31                 ; Convert to leading zeros
    mov eax, ecx
    ret
.zero:
    mov eax, 32
    ret

Working with Flags

; Save and restore flags
save_flags:
    pushf                       ; Push flags onto stack
    pop rax                     ; Pop into rax
    ret

restore_flags:
    push rdi                    ; rdi contains flags
    popf                        ; Pop into flags register
    ret

; Conditional moves based on flags
conditional_max:
    ; max(eax, ebx) -> eax
    cmp eax, ebx
    cmovl eax, ebx              ; if (eax < ebx) eax = ebx
    ret

Loops and Optimization

Loop Unrolling

Trade code size for speed by reducing loop overhead.

Standard Loop:

; Sum array of 1000 elements
sum_array:
    xor eax, eax
    xor ecx, ecx
.loop:
    add eax, [rdi + rcx*4]
    inc rcx
    cmp rcx, 1000
    jl .loop
    ret

Unrolled Loop (4x):

sum_array_unrolled:
    xor eax, eax
    xor ecx, ecx
.loop:
    add eax, [rdi + rcx*4]      ; Iteration 1
    add eax, [rdi + rcx*4 + 4]  ; Iteration 2
    add eax, [rdi + rcx*4 + 8]  ; Iteration 3
    add eax, [rdi + rcx*4 + 12] ; Iteration 4
    add rcx, 4
    cmp rcx, 1000
    jl .loop
    ret

SIMD Operations

Using vector instructions for parallel processing.

x64 SSE Example:

; Add two arrays of floats (4 at a time)
; rdi = array1, rsi = array2, rdx = result, rcx = count
add_arrays_sse:
    xor rax, rax
.loop:
    cmp rax, rcx
    jge .done

    movaps xmm0, [rdi + rax*4]  ; Load 4 floats from array1
    movaps xmm1, [rsi + rax*4]  ; Load 4 floats from array2
    addps xmm0, xmm1            ; Add 4 floats in parallel
    movaps [rdx + rax*4], xmm0  ; Store result

    add rax, 4                  ; Process 4 elements per iteration
    jmp .loop

.done:
    ret

System Instructions

CPU Control Instructions

; Halt the CPU
halt:
    hlt                         ; Halt until interrupt
    ret

; No operation (delay/alignment)
nop_delay:
    nop
    nop
    nop
    ret

; Disable/enable interrupts
disable_interrupts:
    cli                         ; Clear interrupt flag
    ret

enable_interrupts:
    sti                         ; Set interrupt flag
    ret

; Read/write control registers (x64)
read_cr0:
    mov rax, cr0
    ret

write_cr0:
    mov cr0, rdi
    ret

Port I/O (x86/x64)

; Read byte from I/O port
; Input: edi = port
; Output: al = value
inb:
    mov edx, edi
    in al, dx
    ret

; Write byte to I/O port
; Input: edi = port, esi = value
outb:
    mov edx, edi
    mov eax, esi
    out dx, al
    ret

Interfacing Assembly with C

Calling C from Assembly

x64:

extern printf                   ; Declare external C function

section .data
    format db "Value: %d", 10, 0

section .text
global asm_print

asm_print:
    push rbp
    mov rbp, rsp

    ; Call printf(format, 42)
    mov rdi, format             ; First arg: format string
    mov esi, 42                 ; Second arg: value
    xor eax, eax                ; No floating-point args
    call printf

    pop rbp
    ret

Calling Assembly from C

C code:

// Function declarations
extern int asm_add(int a, int b);
extern void asm_print_array(int *arr, int len);

int main() {
    int result = asm_add(5, 10);

    int numbers[] = {1, 2, 3, 4, 5};
    asm_print_array(numbers, 5);

    return 0;
}

Assembly implementation:

global asm_add
global asm_print_array

; int asm_add(int a, int b)
asm_add:
    mov eax, edi
    add eax, esi
    ret

; void asm_print_array(int *arr, int len)
asm_print_array:
    push rbp
    mov rbp, rsp
    push rbx
    push r12
    push r13

    mov rbx, rdi                ; arr
    mov r12, rsi                ; len
    xor r13, r13                ; index

.loop:
    cmp r13, r12
    jge .done

    mov edi, [rbx + r13*4]
    call print_number           ; External C function
    inc r13
    jmp .loop

.done:
    pop r13
    pop r12
    pop rbx
    pop rbp
    ret

Debugging Assembly Code

Using GDB

# Compile with debug symbols
nasm -f elf64 -g -F dwarf program.asm
gcc -g -o program program.o

# Start GDB
gdb ./program

# GDB commands
(gdb) break main              # Set breakpoint
(gdb) run                     # Run program
(gdb) layout asm              # Show assembly view
(gdb) stepi                   # Step one instruction
(gdb) info registers          # Show all registers
(gdb) x/10x $rsp              # Examine stack
(gdb) disassemble main        # Disassemble function

Common Debugging Techniques

; Add debug markers
debug_point:
    nop                         ; Breakpoint location
    nop
    nop

; Print register values (using C printf)
debug_print_rax:
    push rdi
    push rsi
    push rax

    mov rdi, debug_fmt
    mov rsi, rax
    xor eax, eax
    call printf

    pop rax
    pop rsi
    pop rdi
    ret

section .data
debug_fmt: db "RAX = 0x%lx", 10, 0

Key Concepts

  • Arrays are accessed using base + index * element_size
  • Strings are null-terminated character arrays
  • Structures use fixed offsets for member access
  • Stack frames manage local variables and function calls
  • Inline assembly integrates assembly in C code
  • Jump tables optimize switch statements
  • Loop unrolling trades space for speed
  • SIMD enables parallel data processing
  • Port I/O accesses hardware devices (x86/x64)

Common Mistakes

  1. Wrong calling convention - Check ABI for your platform
  2. Stack misalignment - x64 requires 16-byte alignment
  3. Register clobbering - Save/restore callee-saved registers
  4. Incorrect offsets - Struct member offsets must match C
  5. Missing null checks - Dereference can crash
  6. Buffer overruns - Always bounds-check arrays
  7. Forgetting stack cleanup - Leads to stack corruption

Debugging Tips

  • Use debugger step-by-step - Don't guess, watch execution
  • Check register values - Verify each computation
  • Examine stack - Look for corruption or misalignment
  • Add debug prints - Print intermediate values
  • Compare with C version - Compile C to assembly and compare
  • Test small - Verify each function independently
  • Check flags - Conditional jumps depend on flags

Mini Exercises

  1. Implement strcmp in assembly (compare two strings)
  2. Write assembly function to reverse an array in-place
  3. Implement bubble sort in assembly
  4. Create a function that counts set bits (popcount)
  5. Write assembly to find min/max in an array
  6. Implement memcpy in assembly
  7. Create a factorial function (iterative and recursive)
  8. Write inline assembly to read CPU timestamp counter
  9. Implement a simple hash function
  10. Create assembly function to swap two integers using XOR

Review Questions

  1. How do you access the third element of an integer array in assembly?
  2. What's the difference between call and jmp?
  3. Why must you save callee-saved registers in a function?
  4. How does a jump table optimize switch statements?
  5. What's the purpose of function prologue and epilogue?

Reference Checklist

By the end of this chapter, you should be able to:

  • Implement arrays and strings in assembly
  • Access struct members correctly
  • Write proper function prologues and epilogues
  • Follow calling conventions for your architecture
  • Use inline assembly in C
  • Implement jump tables for switch statements
  • Perform bitwise operations and flag manipulation
  • Optimize loops with unrolling and SIMD
  • Call C functions from assembly and vice versa
  • Debug assembly code with GDB

Next Steps

With advanced assembly skills, you're ready to tackle memory management. The next chapter explores how operating systems manage memory: virtual memory, paging, segmentation, and the Memory Management Unit (MMU).


Key Takeaway: Advanced assembly programming requires understanding data structures, calling conventions, and optimization techniques. Combined with debugging skills and C integration, you can write and maintain low-level system code effectively.