NANDHOO.

Advanced Assembly Programming

Chapter 10: Advanced Assembly Programming


Introduction


Moving beyond basic assembly syntax, advanced assembly programming involves implementing complex algorithms, data structures, and interfacing with high-level code. This chapter explores real-world assembly programming techniques used in bootloaders, kernels, and performance-critical system components.


Why This Matters


While most code is written in C or higher-level languages, certain tasks require assembly: bootloader initialization, context switching, interrupt handlers, CPU-specific optimizations, and inline assembly in kernels. Understanding advanced assembly makes you capable of working on the lowest-level system code.


How to Study This Chapter


  1. Practice incrementally - Start with simple programs, add complexity gradually
  2. Use debuggers - GDB and LLDB are essential for understanding assembly flow
  3. Compare with C - Compile C to assembly to see patterns
  4. Test thoroughly - Assembly bugs can be subtle and dangerous

Implementing Data Structures in Assembly


Arrays


Arrays in assembly are just consecutive memory locations.


x64 Example: Sum array elements

section .data
    array dd 10, 20, 30, 40, 50    ; 5 integers
    length equ 5

section .bss sum resd 1 ; Reserve space for sum


section .text global main


main: push rbp mov rbp, rsp


xor rax, rax                    ; sum = 0
xor rcx, rcx                    ; index = 0

.loop: cmp rcx, length ; if (index >= length) jge .done


add eax, [array + rcx*4]        ; sum += array[index]
inc rcx                         ; index++
jmp .loop

.done: mov [sum], eax ; Store result


pop rbp
ret

ARM Example: Array access

.data
    array: .word 10, 20, 30, 40, 50
    length: .word 5

.text .global main


main: push {lr}


ldr r1, =array                  @ r1 = array address
ldr r2, =length                 @ r2 = length address
ldr r2, [r2]                    @ r2 = length value

mov r0, #0                      @ sum = 0
mov r3, #0                      @ index = 0

loop: cmp r3, r2 @ if (index >= length) bge done


ldr r4, [r1, r3, lsl #2]        @ r4 = array[index]
add r0, r0, r4                  @ sum += r4
add r3, r3, #1                  @ index++
b loop

done: pop {pc} @ Return


Strings


Strings are null-terminated character arrays.


x64 Example: String length

section .text
global strlen

; size_t strlen(const char *str) ; Input: rdi = string pointer ; Output: rax = length


strlen: xor rax, rax ; length = 0


.loop: cmp byte [rdi + rax], 0 ; if (*str == '\0') je .done inc rax ; length++ jmp .loop


.done: ret


String copy (strcpy)

global strcpy

; char* strcpy(char *dest, const char *src) ; Input: rdi = dest, rsi = src ; Output: rax = dest


strcpy: push rbp mov rbp, rsp


mov rax, rdi                    ; Save original dest

.loop: mov cl, [rsi] ; Load byte from src mov [rdi], cl ; Store byte to dest inc rsi inc rdi test cl, cl ; Check if null jnz .loop ; Continue if not null


pop rbp
ret

Structures (Structs)


Accessing struct members uses offsets.


C Structure:

struct Point {
    int x;      // offset 0
    int y;      // offset 4
};

x64 Assembly:

section .data
    point1:
        dd 10           ; x = 10
        dd 20           ; y = 20

section .text global get_x


; int get_x(struct Point *p) get_x: mov eax, [rdi] ; Return p->x (offset 0) ret


global get_y get_y: mov eax, [rdi + 4] ; Return p->y (offset 4) ret


global set_point ; void set_point(struct Point *p, int x, int y) ; rdi = p, esi = x, edx = y set_point: mov [rdi], esi ; p->x = x mov [rdi + 4], edx ; p->y = y ret


Stack Frames and Function Calls


Standard Function Prologue and Epilogue


x64:

function:
    ; Prologue
    push rbp                ; Save old frame pointer
    mov rbp, rsp            ; Set new frame pointer
    sub rsp, 32             ; Allocate 32 bytes local space

; Function body
; [rbp - 8] = local variable 1
; [rbp - 16] = local variable 2

; Epilogue
mov rsp, rbp            ; Restore stack pointer
pop rbp                 ; Restore frame pointer
ret

ARM:

function:
    @ Prologue
    push {fp, lr}           @ Save frame pointer and return address
    mov fp, sp              @ Set frame pointer
    sub sp, sp, #16         @ Allocate local space

@ Function body

@ Epilogue
mov sp, fp              @ Restore stack pointer
pop {fp, pc}            @ Restore and return

Calling Conventions in Practice


x64 System V (Linux/macOS):

; int add_three(int a, int b, int c)
add_three:
    mov eax, edi            ; a is in edi
    add eax, esi            ; b is in esi
    add eax, edx            ; c is in edx
    ret

; Calling add_three(5, 10, 15) main: mov edi, 5 ; First argument mov esi, 10 ; Second argument mov edx, 15 ; Third argument call add_three ; Result in eax (30) ret


x64 Microsoft (Windows):

; int add_three(int a, int b, int c)
add_three:
    mov eax, ecx            ; a is in ecx
    add eax, edx            ; b is in edx
    add eax, r8d            ; c is in r8d
    ret

; Calling requires shadow space main: sub rsp, 32 ; Allocate shadow space mov ecx, 5 mov edx, 10 mov r8d, 15 call add_three add rsp, 32 ; Clean up shadow space ret


Inline Assembly in C


Mixing C and assembly for critical sections.


x64 GCC/Clang Inline Assembly


Basic Syntax:

asm("assembly code"
    : output operands
    : input operands
    : clobbered registers
);

Examples:


// Read CPU timestamp counter
static inline uint64_t rdtsc(void) {
    uint32_t lo, hi;
    asm volatile("rdtsc" : "=a"(lo), "=d"(hi));
    return ((uint64_t)hi << 32) | lo;
}

// Atomic increment static inline void atomic_inc(int *ptr) { asm volatile("lock incl %0" : "+m"(*ptr) : : "memory"); }


// Get CPU ID static inline void cpuid(uint32_t code, uint32_t *a, uint32_t *d) { asm volatile("cpuid" : "=a"(*a), "=d"(*d) : "a"(code) : "ebx", "ecx"); }


// Spinlock static inline void spin_lock(volatile int *lock) { asm volatile( "1: mov $1, %%eax\n" " xchg %%eax, %0\n" " test %%eax, %%eax\n" " jnz 1b\n" : "+m"(*lock) : : "eax", "memory" ); }


ARM Inline Assembly


// Get current program status register
static inline uint32_t get_cpsr(void) {
    uint32_t cpsr;
    asm volatile("mrs %0, cpsr" : "=r"(cpsr));
    return cpsr;
}

// Memory barrier static inline void memory_barrier(void) { asm volatile("dmb" ::: "memory"); }


// Enable interrupts static inline void enable_interrupts(void) { asm volatile("cpsie i"); }


Advanced Control Flow


Switch Statements (Jump Tables)


Efficient switch implementation using jump tables.


C Code:

int handle_command(int cmd) {
    switch(cmd) {
        case 0: return do_read();
        case 1: return do_write();
        case 2: return do_close();
        default: return -1;
    }
}

x64 Assembly with Jump Table:

section .text
global handle_command

handle_command: ; edi contains cmd cmp edi, 2 ; Check if cmd > 2 ja .default ; If yes, default case


lea rax, [jump_table]       ; Load jump table address
mov rax, [rax + rdi*8]      ; Get function pointer
jmp rax                     ; Jump to handler

.case0: call do_read ret


.case1: call do_write ret


.case2: call do_close ret


.default: mov eax, -1 ret


section .rodata jump_table: dq handle_command.case0 dq handle_command.case1 dq handle_command.case2


Bitwise Operations and Flags


Setting, Clearing, Testing Bits


; Set bit n in register
; Input: eax = value, ecx = bit number
set_bit:
    bts eax, ecx                ; Bit test and set
    ret

; Clear bit n clear_bit: btr eax, ecx ; Bit test and reset ret


; Test bit n (result in CF flag) test_bit: bt eax, ecx ; Bit test setc al ; Set al if CF=1 movzx eax, al ret


; Count leading zeros count_leading_zeros: bsr ecx, eax ; Bit scan reverse jz .zero xor ecx, 31 ; Convert to leading zeros mov eax, ecx ret .zero: mov eax, 32 ret


Working with Flags


; Save and restore flags
save_flags:
    pushf                       ; Push flags onto stack
    pop rax                     ; Pop into rax
    ret

restore_flags: push rdi ; rdi contains flags popf ; Pop into flags register ret


; Conditional moves based on flags conditional_max: ; max(eax, ebx) -> eax cmp eax, ebx cmovl eax, ebx ; if (eax < ebx) eax = ebx ret


Loops and Optimization


Loop Unrolling


Trade code size for speed by reducing loop overhead.


Standard Loop:

; Sum array of 1000 elements
sum_array:
    xor eax, eax
    xor ecx, ecx
.loop:
    add eax, [rdi + rcx*4]
    inc rcx
    cmp rcx, 1000
    jl .loop
    ret

Unrolled Loop (4x):

sum_array_unrolled:
    xor eax, eax
    xor ecx, ecx
.loop:
    add eax, [rdi + rcx*4]      ; Iteration 1
    add eax, [rdi + rcx*4 + 4]  ; Iteration 2
    add eax, [rdi + rcx*4 + 8]  ; Iteration 3
    add eax, [rdi + rcx*4 + 12] ; Iteration 4
    add rcx, 4
    cmp rcx, 1000
    jl .loop
    ret

SIMD Operations


Using vector instructions for parallel processing.


x64 SSE Example:

; Add two arrays of floats (4 at a time)
; rdi = array1, rsi = array2, rdx = result, rcx = count
add_arrays_sse:
    xor rax, rax
.loop:
    cmp rax, rcx
    jge .done

movaps xmm0, [rdi + rax*4]  ; Load 4 floats from array1
movaps xmm1, [rsi + rax*4]  ; Load 4 floats from array2
addps xmm0, xmm1            ; Add 4 floats in parallel
movaps [rdx + rax*4], xmm0  ; Store result

add rax, 4                  ; Process 4 elements per iteration
jmp .loop

.done: ret


System Instructions


CPU Control Instructions


; Halt the CPU
halt:
    hlt                         ; Halt until interrupt
    ret

; No operation (delay/alignment) nop_delay: nop nop nop ret


; Disable/enable interrupts disable_interrupts: cli ; Clear interrupt flag ret


enable_interrupts: sti ; Set interrupt flag ret


; Read/write control registers (x64) read_cr0: mov rax, cr0 ret


write_cr0: mov cr0, rdi ret


Port I/O (x86/x64)


; Read byte from I/O port
; Input: edi = port
; Output: al = value
inb:
    mov edx, edi
    in al, dx
    ret

; Write byte to I/O port ; Input: edi = port, esi = value outb: mov edx, edi mov eax, esi out dx, al ret


Interfacing Assembly with C


Calling C from Assembly


x64:

extern printf                   ; Declare external C function

section .data format db "Value: %d", 10, 0


section .text global asm_print


asm_print: push rbp mov rbp, rsp


; Call printf(format, 42)
mov rdi, format             ; First arg: format string
mov esi, 42                 ; Second arg: value
xor eax, eax                ; No floating-point args
call printf

pop rbp
ret

Calling Assembly from C


C code:

// Function declarations
extern int asm_add(int a, int b);
extern void asm_print_array(int *arr, int len);

int main() { int result = asm_add(5, 10);


int numbers[] = {1, 2, 3, 4, 5};
asm_print_array(numbers, 5);

return 0;

}


Assembly implementation:

global asm_add
global asm_print_array

; int asm_add(int a, int b) asm_add: mov eax, edi add eax, esi ret


; void asm_print_array(int *arr, int len) asm_print_array: push rbp mov rbp, rsp push rbx push r12 push r13


mov rbx, rdi                ; arr
mov r12, rsi                ; len
xor r13, r13                ; index

.loop: cmp r13, r12 jge .done


mov edi, [rbx + r13*4]
call print_number           ; External C function
inc r13
jmp .loop

.done: pop r13 pop r12 pop rbx pop rbp ret


Debugging Assembly Code


Using GDB


# Compile with debug symbols
nasm -f elf64 -g -F dwarf program.asm
gcc -g -o program program.o

Start GDB

gdb ./program


GDB commands

(gdb) break main # Set breakpoint (gdb) run # Run program (gdb) layout asm # Show assembly view (gdb) stepi # Step one instruction (gdb) info registers # Show all registers (gdb) x/10x $rsp # Examine stack (gdb) disassemble main # Disassemble function


Common Debugging Techniques


; Add debug markers
debug_point:
    nop                         ; Breakpoint location
    nop
    nop

; Print register values (using C printf) debug_print_rax: push rdi push rsi push rax


mov rdi, debug_fmt
mov rsi, rax
xor eax, eax
call printf

pop rax
pop rsi
pop rdi
ret

section .data debug_fmt: db "RAX = 0x%lx", 10, 0


Key Concepts


  • Arrays are accessed using base + index * element_size
  • Strings are null-terminated character arrays
  • Structures use fixed offsets for member access
  • Stack frames manage local variables and function calls
  • Inline assembly integrates assembly in C code
  • Jump tables optimize switch statements
  • Loop unrolling trades space for speed
  • SIMD enables parallel data processing
  • Port I/O accesses hardware devices (x86/x64)

Common Mistakes


  1. Wrong calling convention - Check ABI for your platform
  2. Stack misalignment - x64 requires 16-byte alignment
  3. Register clobbering - Save/restore callee-saved registers
  4. Incorrect offsets - Struct member offsets must match C
  5. Missing null checks - Dereference can crash
  6. Buffer overruns - Always bounds-check arrays
  7. Forgetting stack cleanup - Leads to stack corruption

Debugging Tips


  • Use debugger step-by-step - Don't guess, watch execution
  • Check register values - Verify each computation
  • Examine stack - Look for corruption or misalignment
  • Add debug prints - Print intermediate values
  • Compare with C version - Compile C to assembly and compare
  • Test small - Verify each function independently
  • Check flags - Conditional jumps depend on flags

Mini Exercises


  1. Implement strcmp in assembly (compare two strings)
  2. Write assembly function to reverse an array in-place
  3. Implement bubble sort in assembly
  4. Create a function that counts set bits (popcount)
  5. Write assembly to find min/max in an array
  6. Implement memcpy in assembly
  7. Create a factorial function (iterative and recursive)
  8. Write inline assembly to read CPU timestamp counter
  9. Implement a simple hash function
  10. Create assembly function to swap two integers using XOR

Review Questions


  1. How do you access the third element of an integer array in assembly?
  2. What's the difference between call and jmp?
  3. Why must you save callee-saved registers in a function?
  4. How does a jump table optimize switch statements?
  5. What's the purpose of function prologue and epilogue?

Reference Checklist


By the end of this chapter, you should be able to:

  • Implement arrays and strings in assembly
  • Access struct members correctly
  • Write proper function prologues and epilogues
  • Follow calling conventions for your architecture
  • Use inline assembly in C
  • Implement jump tables for switch statements
  • Perform bitwise operations and flag manipulation
  • Optimize loops with unrolling and SIMD
  • Call C functions from assembly and vice versa
  • Debug assembly code with GDB

Next Steps


With advanced assembly skills, you're ready to tackle memory management. The next chapter explores how operating systems manage memory: virtual memory, paging, segmentation, and the Memory Management Unit (MMU).




Key Takeaway: Advanced assembly programming requires understanding data structures, calling conventions, and optimization techniques. Combined with debugging skills and C integration, you can write and maintain low-level system code effectively.