ARM Kernel Development

Chapter 16: ARM Kernel Development

Introduction

ARM processors power billions of devices - smartphones, tablets, embedded systems, and increasingly, servers and desktops. ARM kernel development differs significantly from x86/x64 due to RISC architecture, different boot process, and varied hardware platforms. This chapter guides you through creating a kernel for ARM architecture.

Why This Matters

ARM is everywhere. From Raspberry Pi to Apple M-series chips, from IoT devices to automotive systems, ARM dominates mobile and embedded computing. Understanding ARM kernel development opens opportunities in mobile OS development, embedded systems, and the growing ARM server market.

How to Study This Chapter

  1. Understand RISC principles - ARM is simpler than x86 in many ways
  2. Target specific hardware - ARM has many variants (Raspberry Pi, Versatile, etc.)
  3. Use device trees - ARM systems describe hardware via device trees
  4. Test in QEMU - Start with emulation before real hardware
  5. Read ARM manuals - ARMv7/ARMv8 architecture reference manuals

ARM Boot Process

ARM vs x86 Boot

Aspectx86/x64ARM
FirmwareBIOS/UEFIU-Boot/Vendor bootloader
Entry Mode16-bit real mode32/64-bit mode (depends on variant)
Entry Point0xFFFFFFF0Platform-specific
Boot StandardMBR/GPTPlatform-specific
Device InfoPCI enumerationDevice tree

Typical ARM Boot Sequence

1. Power On
   ↓
2. Boot ROM (SoC-specific, in silicon)
   ↓
3. First-stage bootloader (U-Boot SPL)
   ↓
4. Second-stage bootloader (U-Boot)
   ↓
5. Load kernel + device tree
   ↓
6. Jump to kernel entry (with parameters)
   ↓
7. Kernel initializes and runs

Project Setup for ARM

Directory Structure

arm-kernel/
├── boot/
│   └── boot.s            # ARM entry point
├── kernel/
│   ├── main.c            # Kernel main
│   ├── uart.c            # Serial driver
│   ├── mmu.c             # Memory management
│   └── interrupts.c      # Exception/interrupt handling
├── include/
│   └── types.h
├── linker.ld
└── Makefile

Cross-Compilation Toolchain

# Install ARM cross-compiler (Ubuntu/Debian)
sudo apt-get install gcc-arm-none-eabi gdb-multiarch

# Or for Linux userspace:
sudo apt-get install gcc-arm-linux-gnueabi

# Verify installation
arm-none-eabi-gcc --version

Makefile for ARM

# Makefile for ARM kernel (bare metal)

CC = arm-none-eabi-gcc
LD = arm-none-eabi-ld
OBJCOPY = arm-none-eabi-objcopy
QEMU = qemu-system-arm

# For Versatile PB (ARM926EJ-S)
CFLAGS = -mcpu=arm926ej-s -mfloat-abi=soft -nostdlib -ffreestanding \
         -Iinclude -Wall -Wextra -O2

LDFLAGS = -T linker.ld

SOURCES = boot/boot.o kernel/main.o kernel/uart.o kernel/mmu.o kernel/interrupts.o
TARGET = kernel.elf
BINARY = kernel.bin

all: $(BINARY)

boot/boot.o: boot/boot.s
	$(CC) $(CFLAGS) -c -o $@ $<

%.o: %.c
	$(CC) $(CFLAGS) -c -o $@ $<

$(TARGET): $(SOURCES) linker.ld
	$(LD) $(LDFLAGS) -o $@ $(SOURCES)

$(BINARY): $(TARGET)
	$(OBJCOPY) -O binary $< $@

run: $(BINARY)
	$(QEMU) -M versatilepb -m 128M -kernel $(TARGET) -serial stdio -nographic

debug: $(BINARY)
	$(QEMU) -M versatilepb -m 128M -kernel $(TARGET) -serial stdio -s -S &
	gdb-multiarch $(TARGET) \
		-ex "target remote :1234" \
		-ex "break kernel_main" \
		-ex "continue"

clean:
	rm -f boot/*.o kernel/*.o $(TARGET) $(BINARY)

.PHONY: all run debug clean

ARM Boot Code (ARMv7)

Linker Script

linker.ld:

ENTRY(_start)

SECTIONS
{
    . = 0x10000;  /* Kernel load address for Versatile */

    .text : {
        *(.text.boot)
        *(.text)
    }

    .rodata : {
        *(.rodata)
    }

    .data : {
        *(.data)
    }

    .bss : {
        __bss_start = .;
        *(.bss)
        *(COMMON)
        __bss_end = .;
    }

    . = ALIGN(8);
    . = . + 0x1000; /* 4KB stack */
    stack_top = .;
}

Boot Assembly (ARMv7)

boot/boot.s:

.section .text.boot
.global _start

_start:
    @ We enter in supervisor mode

    @ Set up stack pointer
    ldr sp, =stack_top

    @ Clear BSS section
    ldr r0, =__bss_start
    ldr r1, =__bss_end
    mov r2, #0
clear_bss:
    cmp r0, r1
    bge clear_done
    str r2, [r0], #4
    b clear_bss

clear_done:
    @ Jump to C code
    bl kernel_main

    @ Hang if kernel returns
hang:
    wfe
    b hang

UART Driver (Serial Output)

ARM platforms use memory-mapped UART (not port I/O like x86).

kernel/uart.c:

#include "types.h"

// UART0 base address for Versatile PB
#define UART0_BASE 0x101f1000

#define UART0_DR   (*(volatile uint32_t *)(UART0_BASE + 0x00))  // Data register
#define UART0_FR   (*(volatile uint32_t *)(UART0_BASE + 0x18))  // Flag register

// Flag register bits
#define UART_FR_TXFF (1 << 5)  // Transmit FIFO full
#define UART_FR_RXFE (1 << 4)  // Receive FIFO empty

void uart_putc(char c) {
    // Wait until transmit FIFO not full
    while (UART0_FR & UART_FR_TXFF);

    UART0_DR = c;
}

void uart_puts(const char *str) {
    while (*str) {
        if (*str == '\n') {
            uart_putc('\r');  // Add carriage return
        }
        uart_putc(*str++);
    }
}

char uart_getc(void) {
    // Wait until data available
    while (UART0_FR & UART_FR_RXFE);

    return UART0_DR & 0xFF;
}

void uart_init(void) {
    // UART is already initialized by QEMU
    // On real hardware, you'd configure baud rate, etc.
}

Kernel Main

kernel/main.c:

#include "types.h"

extern void uart_init(void);
extern void uart_puts(const char *);

void kernel_main(void) {
    uart_init();

    uart_puts("ARM Kernel Starting...\n");
    uart_puts("Hello from ARM!\n");

    // Hang
    while (1) {
        asm volatile("wfe");  // Wait for event
    }
}

include/types.h:

#ifndef TYPES_H
#define TYPES_H

typedef unsigned char      uint8_t;
typedef unsigned short     uint16_t;
typedef unsigned int       uint32_t;
typedef unsigned long long uint64_t;

typedef signed char        int8_t;
typedef signed short       int16_t;
typedef signed int         int32_t;
typedef signed long long   int64_t;

typedef uint32_t           size_t;
typedef uint8_t            bool;

#define true  1
#define false 0
#define NULL  ((void*)0)

#endif

Testing the Basic Kernel

make
make run

Expected output:

ARM Kernel Starting...
Hello from ARM!

ARM MMU (ARMv7)

Setting Up Page Tables

kernel/mmu.c:

#include "types.h"

extern void uart_puts(const char *);

// First-level page table (16KB aligned)
static uint32_t page_table[4096] __attribute__((aligned(16384)));

// Section descriptor bits
#define PT_SECTION     (1 << 1)
#define PT_B           (1 << 2)  // Bufferable
#define PT_C           (1 << 3)  // Cacheable
#define PT_AP_RW       (3 << 10) // Access: read/write
#define PT_DOMAIN(x)   ((x) << 5)
#define PT_XN          (1 << 4)  // Execute never

void mmu_section(uint32_t virt, uint32_t phys, uint32_t flags) {
    uint32_t idx = virt >> 20;  // 1 MB sections
    page_table[idx] = (phys & 0xFFF00000) | flags | PT_SECTION;
}

void mmu_init(void) {
    uart_puts("Initializing MMU...\n");

    // Clear page table
    for (int i = 0; i < 4096; i++) {
        page_table[i] = 0;
    }

    // Identity map first 128 MB (device memory and RAM)
    for (uint32_t addr = 0; addr < 0x8000000; addr += 0x100000) {
        mmu_section(addr, addr, PT_AP_RW | PT_DOMAIN(0) | PT_B | PT_C);
    }

    // Set domain 0 to manager mode
    uint32_t dacr = 0x3;  // Domain 0: manager
    asm volatile("mcr p15, 0, %0, c3, c0, 0" : : "r"(dacr));

    // Set translation table base
    asm volatile("mcr p15, 0, %0, c2, c0, 0" : : "r"(page_table));

    // Enable MMU
    uint32_t sctlr;
    asm volatile("mrc p15, 0, %0, c1, c0, 0" : "=r"(sctlr));
    sctlr |= 0x1;  // Enable MMU (M bit)
    sctlr |= (1 << 12);  // Enable I-cache
    sctlr |= (1 << 2);   // Enable D-cache
    asm volatile("mcr p15, 0, %0, c1, c0, 0" : : "r"(sctlr));

    uart_puts("MMU enabled\n");
}

ARM Exception Handling

Vector Table

boot/boot.s (updated):

.section .text.boot
.global _start

_start:
    @ Set up exception vector table
    ldr pc, =reset_handler
    ldr pc, =undefined_handler
    ldr pc, =swi_handler
    ldr pc, =prefetch_abort_handler
    ldr pc, =data_abort_handler
    nop  @ Reserved
    ldr pc, =irq_handler
    ldr pc, =fiq_handler

reset_handler:
    @ Set up stack pointer
    ldr sp, =stack_top

    @ Copy vector table to 0x00000000
    ldr r0, =_start
    mov r1, #0x0000
    ldmia r0!, {r2-r9}
    stmia r1!, {r2-r9}
    ldmia r0!, {r2-r9}
    stmia r1!, {r2-r9}

    @ Clear BSS
    ldr r0, =__bss_start
    ldr r1, =__bss_end
    mov r2, #0
clear_bss:
    cmp r0, r1
    bge clear_done
    str r2, [r0], #4
    b clear_bss

clear_done:
    @ Jump to C code
    bl kernel_main

hang:
    wfe
    b hang

@ Exception handlers
undefined_handler:
    b undefined_handler

swi_handler:
    @ System call handler
    push {r0-r12, lr}
    bl syscall_handler
    pop {r0-r12, pc}^

prefetch_abort_handler:
    b prefetch_abort_handler

data_abort_handler:
    b data_abort_handler

irq_handler:
    push {r0-r3, r12, lr}
    bl irq_dispatcher
    pop {r0-r3, r12, lr}
    subs pc, lr, #4

fiq_handler:
    b fiq_handler

Interrupt Controller

kernel/interrupts.c:

#include "types.h"

extern void uart_puts(const char *);

// Versatile Interrupt Controller
#define VIC_BASE 0x10140000
#define VIC_INTENABLE  (*(volatile uint32_t *)(VIC_BASE + 0x10))
#define VIC_INTDISABLE (*(volatile uint32_t *)(VIC_BASE + 0x14))

// Timer base address
#define TIMER0_BASE 0x101E2000
#define TIMER_LOAD    (*(volatile uint32_t *)(TIMER0_BASE + 0x00))
#define TIMER_VALUE   (*(volatile uint32_t *)(TIMER0_BASE + 0x04))
#define TIMER_CONTROL (*(volatile uint32_t *)(TIMER0_BASE + 0x08))
#define TIMER_INTCLR  (*(volatile uint32_t *)(TIMER0_BASE + 0x0C))

#define TIMER_EN      (1 << 7)
#define TIMER_PERIODIC (1 << 6)
#define TIMER_INTEN   (1 << 5)
#define TIMER_32BIT   (1 << 1)

static uint32_t tick_count = 0;

void irq_dispatcher(void) {
    // For simplicity, assume timer interrupt
    tick_count++;

    if (tick_count % 100 == 0) {
        uart_puts("Tick\n");
    }

    // Clear timer interrupt
    TIMER_INTCLR = 1;
}

void timer_init(void) {
    uart_puts("Initializing timer...\n");

    // Set timer to fire every 10ms (assuming 1MHz clock)
    TIMER_LOAD = 10000;

    // Enable timer (periodic, 32-bit, interrupts enabled)
    TIMER_CONTROL = TIMER_EN | TIMER_PERIODIC | TIMER_INTEN | TIMER_32BIT;

    // Enable timer interrupt in VIC (IRQ 4 for timer 0/1)
    VIC_INTENABLE = (1 << 4);

    // Enable IRQs in CPU
    uint32_t cpsr;
    asm volatile("mrs %0, cpsr" : "=r"(cpsr));
    cpsr &= ~(1 << 7);  // Clear I bit (enable IRQ)
    asm volatile("msr cpsr_c, %0" : : "r"(cpsr));

    uart_puts("Timer enabled\n");
}

AArch64 (64-bit ARM) Differences

Boot Code (AArch64)

.section .text.boot
.global _start

_start:
    // Check processor ID (multi-core systems)
    mrs x0, mpidr_el1
    and x0, x0, #0xFF
    cbz x0, primary_cpu
    b hang

primary_cpu:
    // Set up stack
    ldr x0, =stack_top
    mov sp, x0

    // Clear BSS
    ldr x0, =__bss_start
    ldr x1, =__bss_end
    mov x2, #0
clear_bss:
    cmp x0, x1
    b.ge clear_done
    str x2, [x0], #8
    b clear_bss

clear_done:
    // Jump to kernel main
    bl kernel_main

hang:
    wfe
    b hang

AArch64 MMU

// 4KB granule, 48-bit virtual address
#define PT_PAGE      (3 << 0)   // Page descriptor
#define PT_BLOCK     (1 << 0)   // Block descriptor
#define PT_TABLE     (3 << 0)   // Table descriptor
#define PT_VALID     (1 << 0)
#define PT_AF        (1 << 10)  // Access flag
#define PT_SH_INNER  (3 << 8)   // Inner shareable
#define PT_ATTR(x)   ((x) << 2) // Memory attributes

void mmu_init_aarch64(void) {
    // Set up page tables (simplified)
    // Real implementation would set up 4-level paging

    // Configure MAIR_EL1 (Memory Attribute Indirection Register)
    uint64_t mair = 0xFF;  // Normal memory
    asm volatile("msr mair_el1, %0" : : "r"(mair));

    // Configure TCR_EL1 (Translation Control Register)
    uint64_t tcr = 0;
    tcr |= (16 << 0);   // T0SZ: 48-bit address space
    tcr |= (1 << 8);    // Inner shareable
    tcr |= (1 << 10);   // Outer shareable
    tcr |= (0 << 14);   // 4KB granule
    asm volatile("msr tcr_el1, %0" : : "r"(tcr));

    // Set TTBR0_EL1 (page table base)
    // asm volatile("msr ttbr0_el1, %0" : : "r"(page_table));

    // Enable MMU
    uint64_t sctlr;
    asm volatile("mrs %0, sctlr_el1" : "=r"(sctlr));
    sctlr |= (1 << 0);  // M bit (MMU enable)
    sctlr |= (1 << 2);  // C bit (data cache)
    sctlr |= (1 << 12); // I bit (instruction cache)
    asm volatile("msr sctlr_el1, %0" : : "r"(sctlr));
    asm volatile("isb");
}

Device Tree

ARM systems use device trees to describe hardware.

Example device tree snippet:

/ {
    compatible = "arm,versatile-pb";
    model = "ARM Versatile PB";

    memory {
        device_type = "memory";
        reg = <0x00000000 0x08000000>;  // 128 MB at 0x0
    };

    uart0: serial@101f1000 {
        compatible = "arm,pl011", "arm,primecell";
        reg = <0x101f1000 0x1000>;
        interrupts = <12>;
    };

    timer0: timer@101e2000 {
        compatible = "arm,sp804", "arm,primecell";
        reg = <0x101e2000 0x1000>;
        interrupts = <4>;
    };
};

Parsing device tree (simplified):

struct fdt_header {
    uint32_t magic;
    uint32_t totalsize;
    // ... more fields
} __attribute__((packed));

void parse_device_tree(void *fdt) {
    struct fdt_header *header = (struct fdt_header *)fdt;

    if (header->magic != 0xd00dfeed) {  // FDT magic (big-endian)
        uart_puts("Invalid device tree\n");
        return;
    }

    uart_puts("Device tree found\n");
    // Parse nodes and properties...
}

Raspberry Pi Specific

Raspberry Pi 3 Boot

Raspberry Pi uses GPU bootloader:

1. GPU loads bootcode.bin
2. GPU loads start.elf (GPU firmware)
3. GPU loads kernel8.img (64-bit kernel)
4. GPU starts ARM cores
5. Kernel runs

config.txt for bare metal:

kernel=kernel8.img
arm_64bit=1

Raspberry Pi UART

// BCM2837 (Raspberry Pi 3) Mini UART
#define AUX_ENABLES     (*(volatile uint32_t *)(0x3F215004))
#define AUX_MU_IO_REG   (*(volatile uint32_t *)(0x3F215040))
#define AUX_MU_LSR_REG  (*(volatile uint32_t *)(0x3F215054))

void rpi_uart_init(void) {
    AUX_ENABLES = 1;  // Enable mini UART
}

void rpi_uart_putc(char c) {
    while (!(AUX_MU_LSR_REG & 0x20));  // Wait for TX ready
    AUX_MU_IO_REG = c;
}

Key Concepts

  • ARM boot starts in supervisor mode (ARMv7) or EL2/EL1 (AArch64)
  • UART is memory-mapped, not port-based
  • MMU uses different page table format than x86
  • Exception vectors must be at 0x00000000 or 0xFFFF0000
  • VIC (Vectored Interrupt Controller) manages interrupts
  • Device tree describes platform hardware
  • AArch64 uses 4-level page tables similar to x64
  • No BIOS - bootloader responsibilities differ

Common Mistakes

  1. Wrong base addresses - Each platform has different peripheral addresses
  2. Endianness confusion - ARM can be little or big endian
  3. Cache coherency - Not invalidating caches after MMU setup
  4. Alignment - ARM requires aligned memory access
  5. Missing memory barriers - ARM has relaxed memory model
  6. Wrong exception return - Use subs pc, lr, #4 for IRQ
  7. Forgetting device tree - Real hardware needs proper device enumeration

Debugging Tips

  • Use UART early - First thing to get working
  • QEMU is your friend - Test before real hardware
  • GDB multiarch - Use gdb-multiarch for ARM
  • Check alignment - ARM faults on unaligned access
  • Memory barriers - Use dmb, dsb, isb appropriately
  • Read manuals - ARM Architecture Reference Manual is essential
  • Start with QEMU - Versatile PB is well-supported

Mini Exercises

  1. Create a basic ARM kernel that prints to UART
  2. Implement simple printf for UART
  3. Set up MMU with identity mapping
  4. Create exception handlers for all vectors
  5. Initialize timer interrupt
  6. Implement basic keyboard/UART input
  7. Parse device tree to find UART address
  8. Port kernel to Raspberry Pi
  9. Implement AArch64 boot code
  10. Add multi-core support (boot secondary cores)

Review Questions

  1. How does ARM boot process differ from x86?
  2. What is a device tree and why is it used?
  3. How do you enable the MMU on ARMv7?
  4. What are the ARM exception vectors?
  5. How does UART differ between ARM and x86?

Reference Checklist

By the end of this chapter, you should be able to:

  • Set up ARM cross-compilation toolchain
  • Write ARM boot assembly code
  • Initialize UART for serial output
  • Set up ARM MMU (ARMv7)
  • Handle ARM exceptions and interrupts
  • Initialize interrupt controller (VIC)
  • Set up timer interrupts
  • Understand device trees
  • Port kernel between ARM platforms
  • Use QEMU for ARM kernel testing

Next Steps

With both x86/x64 and ARM kernel experience, the next chapter explores Unix, Linux, and shell scripting. You'll learn Linux system programming, shell scripting for automation, and how to interact with the Linux kernel from user space.


Key Takeaway: ARM kernel development differs from x86 in boot process, memory management, and peripheral access. Understanding these differences and using device trees enables you to write kernels for the vast ARM ecosystem.