Chapter 16: ARM Kernel Development
Introduction
ARM processors power billions of devices - smartphones, tablets, embedded systems, and increasingly, servers and desktops. ARM kernel development differs significantly from x86/x64 due to RISC architecture, different boot process, and varied hardware platforms. This chapter guides you through creating a kernel for ARM architecture.
Why This Matters
ARM is everywhere. From Raspberry Pi to Apple M-series chips, from IoT devices to automotive systems, ARM dominates mobile and embedded computing. Understanding ARM kernel development opens opportunities in mobile OS development, embedded systems, and the growing ARM server market.
How to Study This Chapter
- Understand RISC principles - ARM is simpler than x86 in many ways
- Target specific hardware - ARM has many variants (Raspberry Pi, Versatile, etc.)
- Use device trees - ARM systems describe hardware via device trees
- Test in QEMU - Start with emulation before real hardware
- Read ARM manuals - ARMv7/ARMv8 architecture reference manuals
ARM Boot Process
ARM vs x86 Boot
| Aspect | x86/x64 | ARM |
|---|---|---|
| Firmware | BIOS/UEFI | U-Boot/Vendor bootloader |
| Entry Mode | 16-bit real mode | 32/64-bit mode (depends on variant) |
| Entry Point | 0xFFFFFFF0 | Platform-specific |
| Boot Standard | MBR/GPT | Platform-specific |
| Device Info | PCI enumeration | Device tree |
Typical ARM Boot Sequence
1. Power On
↓
2. Boot ROM (SoC-specific, in silicon)
↓
3. First-stage bootloader (U-Boot SPL)
↓
4. Second-stage bootloader (U-Boot)
↓
5. Load kernel + device tree
↓
6. Jump to kernel entry (with parameters)
↓
7. Kernel initializes and runs
Project Setup for ARM
Directory Structure
arm-kernel/
├── boot/
│ └── boot.s # ARM entry point
├── kernel/
│ ├── main.c # Kernel main
│ ├── uart.c # Serial driver
│ ├── mmu.c # Memory management
│ └── interrupts.c # Exception/interrupt handling
├── include/
│ └── types.h
├── linker.ld
└── Makefile
Cross-Compilation Toolchain
# Install ARM cross-compiler (Ubuntu/Debian)
sudo apt-get install gcc-arm-none-eabi gdb-multiarch
# Or for Linux userspace:
sudo apt-get install gcc-arm-linux-gnueabi
# Verify installation
arm-none-eabi-gcc --version
Makefile for ARM
# Makefile for ARM kernel (bare metal)
CC = arm-none-eabi-gcc
LD = arm-none-eabi-ld
OBJCOPY = arm-none-eabi-objcopy
QEMU = qemu-system-arm
# For Versatile PB (ARM926EJ-S)
CFLAGS = -mcpu=arm926ej-s -mfloat-abi=soft -nostdlib -ffreestanding \
-Iinclude -Wall -Wextra -O2
LDFLAGS = -T linker.ld
SOURCES = boot/boot.o kernel/main.o kernel/uart.o kernel/mmu.o kernel/interrupts.o
TARGET = kernel.elf
BINARY = kernel.bin
all: $(BINARY)
boot/boot.o: boot/boot.s
$(CC) $(CFLAGS) -c -o $@ $<
%.o: %.c
$(CC) $(CFLAGS) -c -o $@ $<
$(TARGET): $(SOURCES) linker.ld
$(LD) $(LDFLAGS) -o $@ $(SOURCES)
$(BINARY): $(TARGET)
$(OBJCOPY) -O binary $< $@
run: $(BINARY)
$(QEMU) -M versatilepb -m 128M -kernel $(TARGET) -serial stdio -nographic
debug: $(BINARY)
$(QEMU) -M versatilepb -m 128M -kernel $(TARGET) -serial stdio -s -S &
gdb-multiarch $(TARGET) \
-ex "target remote :1234" \
-ex "break kernel_main" \
-ex "continue"
clean:
rm -f boot/*.o kernel/*.o $(TARGET) $(BINARY)
.PHONY: all run debug clean
ARM Boot Code (ARMv7)
Linker Script
linker.ld:
ENTRY(_start)
SECTIONS
{
. = 0x10000; /* Kernel load address for Versatile */
.text : {
*(.text.boot)
*(.text)
}
.rodata : {
*(.rodata)
}
.data : {
*(.data)
}
.bss : {
__bss_start = .;
*(.bss)
*(COMMON)
__bss_end = .;
}
. = ALIGN(8);
. = . + 0x1000; /* 4KB stack */
stack_top = .;
}
Boot Assembly (ARMv7)
boot/boot.s:
.section .text.boot
.global _start
_start:
@ We enter in supervisor mode
@ Set up stack pointer
ldr sp, =stack_top
@ Clear BSS section
ldr r0, =__bss_start
ldr r1, =__bss_end
mov r2, #0
clear_bss:
cmp r0, r1
bge clear_done
str r2, [r0], #4
b clear_bss
clear_done:
@ Jump to C code
bl kernel_main
@ Hang if kernel returns
hang:
wfe
b hang
UART Driver (Serial Output)
ARM platforms use memory-mapped UART (not port I/O like x86).
kernel/uart.c:
#include "types.h"
// UART0 base address for Versatile PB
#define UART0_BASE 0x101f1000
#define UART0_DR (*(volatile uint32_t *)(UART0_BASE + 0x00)) // Data register
#define UART0_FR (*(volatile uint32_t *)(UART0_BASE + 0x18)) // Flag register
// Flag register bits
#define UART_FR_TXFF (1 << 5) // Transmit FIFO full
#define UART_FR_RXFE (1 << 4) // Receive FIFO empty
void uart_putc(char c) {
// Wait until transmit FIFO not full
while (UART0_FR & UART_FR_TXFF);
UART0_DR = c;
}
void uart_puts(const char *str) {
while (*str) {
if (*str == '\n') {
uart_putc('\r'); // Add carriage return
}
uart_putc(*str++);
}
}
char uart_getc(void) {
// Wait until data available
while (UART0_FR & UART_FR_RXFE);
return UART0_DR & 0xFF;
}
void uart_init(void) {
// UART is already initialized by QEMU
// On real hardware, you'd configure baud rate, etc.
}
Kernel Main
kernel/main.c:
#include "types.h"
extern void uart_init(void);
extern void uart_puts(const char *);
void kernel_main(void) {
uart_init();
uart_puts("ARM Kernel Starting...\n");
uart_puts("Hello from ARM!\n");
// Hang
while (1) {
asm volatile("wfe"); // Wait for event
}
}
include/types.h:
#ifndef TYPES_H
#define TYPES_H
typedef unsigned char uint8_t;
typedef unsigned short uint16_t;
typedef unsigned int uint32_t;
typedef unsigned long long uint64_t;
typedef signed char int8_t;
typedef signed short int16_t;
typedef signed int int32_t;
typedef signed long long int64_t;
typedef uint32_t size_t;
typedef uint8_t bool;
#define true 1
#define false 0
#define NULL ((void*)0)
#endif
Testing the Basic Kernel
make
make run
Expected output:
ARM Kernel Starting...
Hello from ARM!
ARM MMU (ARMv7)
Setting Up Page Tables
kernel/mmu.c:
#include "types.h"
extern void uart_puts(const char *);
// First-level page table (16KB aligned)
static uint32_t page_table[4096] __attribute__((aligned(16384)));
// Section descriptor bits
#define PT_SECTION (1 << 1)
#define PT_B (1 << 2) // Bufferable
#define PT_C (1 << 3) // Cacheable
#define PT_AP_RW (3 << 10) // Access: read/write
#define PT_DOMAIN(x) ((x) << 5)
#define PT_XN (1 << 4) // Execute never
void mmu_section(uint32_t virt, uint32_t phys, uint32_t flags) {
uint32_t idx = virt >> 20; // 1 MB sections
page_table[idx] = (phys & 0xFFF00000) | flags | PT_SECTION;
}
void mmu_init(void) {
uart_puts("Initializing MMU...\n");
// Clear page table
for (int i = 0; i < 4096; i++) {
page_table[i] = 0;
}
// Identity map first 128 MB (device memory and RAM)
for (uint32_t addr = 0; addr < 0x8000000; addr += 0x100000) {
mmu_section(addr, addr, PT_AP_RW | PT_DOMAIN(0) | PT_B | PT_C);
}
// Set domain 0 to manager mode
uint32_t dacr = 0x3; // Domain 0: manager
asm volatile("mcr p15, 0, %0, c3, c0, 0" : : "r"(dacr));
// Set translation table base
asm volatile("mcr p15, 0, %0, c2, c0, 0" : : "r"(page_table));
// Enable MMU
uint32_t sctlr;
asm volatile("mrc p15, 0, %0, c1, c0, 0" : "=r"(sctlr));
sctlr |= 0x1; // Enable MMU (M bit)
sctlr |= (1 << 12); // Enable I-cache
sctlr |= (1 << 2); // Enable D-cache
asm volatile("mcr p15, 0, %0, c1, c0, 0" : : "r"(sctlr));
uart_puts("MMU enabled\n");
}
ARM Exception Handling
Vector Table
boot/boot.s (updated):
.section .text.boot
.global _start
_start:
@ Set up exception vector table
ldr pc, =reset_handler
ldr pc, =undefined_handler
ldr pc, =swi_handler
ldr pc, =prefetch_abort_handler
ldr pc, =data_abort_handler
nop @ Reserved
ldr pc, =irq_handler
ldr pc, =fiq_handler
reset_handler:
@ Set up stack pointer
ldr sp, =stack_top
@ Copy vector table to 0x00000000
ldr r0, =_start
mov r1, #0x0000
ldmia r0!, {r2-r9}
stmia r1!, {r2-r9}
ldmia r0!, {r2-r9}
stmia r1!, {r2-r9}
@ Clear BSS
ldr r0, =__bss_start
ldr r1, =__bss_end
mov r2, #0
clear_bss:
cmp r0, r1
bge clear_done
str r2, [r0], #4
b clear_bss
clear_done:
@ Jump to C code
bl kernel_main
hang:
wfe
b hang
@ Exception handlers
undefined_handler:
b undefined_handler
swi_handler:
@ System call handler
push {r0-r12, lr}
bl syscall_handler
pop {r0-r12, pc}^
prefetch_abort_handler:
b prefetch_abort_handler
data_abort_handler:
b data_abort_handler
irq_handler:
push {r0-r3, r12, lr}
bl irq_dispatcher
pop {r0-r3, r12, lr}
subs pc, lr, #4
fiq_handler:
b fiq_handler
Interrupt Controller
kernel/interrupts.c:
#include "types.h"
extern void uart_puts(const char *);
// Versatile Interrupt Controller
#define VIC_BASE 0x10140000
#define VIC_INTENABLE (*(volatile uint32_t *)(VIC_BASE + 0x10))
#define VIC_INTDISABLE (*(volatile uint32_t *)(VIC_BASE + 0x14))
// Timer base address
#define TIMER0_BASE 0x101E2000
#define TIMER_LOAD (*(volatile uint32_t *)(TIMER0_BASE + 0x00))
#define TIMER_VALUE (*(volatile uint32_t *)(TIMER0_BASE + 0x04))
#define TIMER_CONTROL (*(volatile uint32_t *)(TIMER0_BASE + 0x08))
#define TIMER_INTCLR (*(volatile uint32_t *)(TIMER0_BASE + 0x0C))
#define TIMER_EN (1 << 7)
#define TIMER_PERIODIC (1 << 6)
#define TIMER_INTEN (1 << 5)
#define TIMER_32BIT (1 << 1)
static uint32_t tick_count = 0;
void irq_dispatcher(void) {
// For simplicity, assume timer interrupt
tick_count++;
if (tick_count % 100 == 0) {
uart_puts("Tick\n");
}
// Clear timer interrupt
TIMER_INTCLR = 1;
}
void timer_init(void) {
uart_puts("Initializing timer...\n");
// Set timer to fire every 10ms (assuming 1MHz clock)
TIMER_LOAD = 10000;
// Enable timer (periodic, 32-bit, interrupts enabled)
TIMER_CONTROL = TIMER_EN | TIMER_PERIODIC | TIMER_INTEN | TIMER_32BIT;
// Enable timer interrupt in VIC (IRQ 4 for timer 0/1)
VIC_INTENABLE = (1 << 4);
// Enable IRQs in CPU
uint32_t cpsr;
asm volatile("mrs %0, cpsr" : "=r"(cpsr));
cpsr &= ~(1 << 7); // Clear I bit (enable IRQ)
asm volatile("msr cpsr_c, %0" : : "r"(cpsr));
uart_puts("Timer enabled\n");
}
AArch64 (64-bit ARM) Differences
Boot Code (AArch64)
.section .text.boot
.global _start
_start:
// Check processor ID (multi-core systems)
mrs x0, mpidr_el1
and x0, x0, #0xFF
cbz x0, primary_cpu
b hang
primary_cpu:
// Set up stack
ldr x0, =stack_top
mov sp, x0
// Clear BSS
ldr x0, =__bss_start
ldr x1, =__bss_end
mov x2, #0
clear_bss:
cmp x0, x1
b.ge clear_done
str x2, [x0], #8
b clear_bss
clear_done:
// Jump to kernel main
bl kernel_main
hang:
wfe
b hang
AArch64 MMU
// 4KB granule, 48-bit virtual address
#define PT_PAGE (3 << 0) // Page descriptor
#define PT_BLOCK (1 << 0) // Block descriptor
#define PT_TABLE (3 << 0) // Table descriptor
#define PT_VALID (1 << 0)
#define PT_AF (1 << 10) // Access flag
#define PT_SH_INNER (3 << 8) // Inner shareable
#define PT_ATTR(x) ((x) << 2) // Memory attributes
void mmu_init_aarch64(void) {
// Set up page tables (simplified)
// Real implementation would set up 4-level paging
// Configure MAIR_EL1 (Memory Attribute Indirection Register)
uint64_t mair = 0xFF; // Normal memory
asm volatile("msr mair_el1, %0" : : "r"(mair));
// Configure TCR_EL1 (Translation Control Register)
uint64_t tcr = 0;
tcr |= (16 << 0); // T0SZ: 48-bit address space
tcr |= (1 << 8); // Inner shareable
tcr |= (1 << 10); // Outer shareable
tcr |= (0 << 14); // 4KB granule
asm volatile("msr tcr_el1, %0" : : "r"(tcr));
// Set TTBR0_EL1 (page table base)
// asm volatile("msr ttbr0_el1, %0" : : "r"(page_table));
// Enable MMU
uint64_t sctlr;
asm volatile("mrs %0, sctlr_el1" : "=r"(sctlr));
sctlr |= (1 << 0); // M bit (MMU enable)
sctlr |= (1 << 2); // C bit (data cache)
sctlr |= (1 << 12); // I bit (instruction cache)
asm volatile("msr sctlr_el1, %0" : : "r"(sctlr));
asm volatile("isb");
}
Device Tree
ARM systems use device trees to describe hardware.
Example device tree snippet:
/ {
compatible = "arm,versatile-pb";
model = "ARM Versatile PB";
memory {
device_type = "memory";
reg = <0x00000000 0x08000000>; // 128 MB at 0x0
};
uart0: serial@101f1000 {
compatible = "arm,pl011", "arm,primecell";
reg = <0x101f1000 0x1000>;
interrupts = <12>;
};
timer0: timer@101e2000 {
compatible = "arm,sp804", "arm,primecell";
reg = <0x101e2000 0x1000>;
interrupts = <4>;
};
};
Parsing device tree (simplified):
struct fdt_header {
uint32_t magic;
uint32_t totalsize;
// ... more fields
} __attribute__((packed));
void parse_device_tree(void *fdt) {
struct fdt_header *header = (struct fdt_header *)fdt;
if (header->magic != 0xd00dfeed) { // FDT magic (big-endian)
uart_puts("Invalid device tree\n");
return;
}
uart_puts("Device tree found\n");
// Parse nodes and properties...
}
Raspberry Pi Specific
Raspberry Pi 3 Boot
Raspberry Pi uses GPU bootloader:
1. GPU loads bootcode.bin
2. GPU loads start.elf (GPU firmware)
3. GPU loads kernel8.img (64-bit kernel)
4. GPU starts ARM cores
5. Kernel runs
config.txt for bare metal:
kernel=kernel8.img
arm_64bit=1
Raspberry Pi UART
// BCM2837 (Raspberry Pi 3) Mini UART
#define AUX_ENABLES (*(volatile uint32_t *)(0x3F215004))
#define AUX_MU_IO_REG (*(volatile uint32_t *)(0x3F215040))
#define AUX_MU_LSR_REG (*(volatile uint32_t *)(0x3F215054))
void rpi_uart_init(void) {
AUX_ENABLES = 1; // Enable mini UART
}
void rpi_uart_putc(char c) {
while (!(AUX_MU_LSR_REG & 0x20)); // Wait for TX ready
AUX_MU_IO_REG = c;
}
Key Concepts
- ARM boot starts in supervisor mode (ARMv7) or EL2/EL1 (AArch64)
- UART is memory-mapped, not port-based
- MMU uses different page table format than x86
- Exception vectors must be at 0x00000000 or 0xFFFF0000
- VIC (Vectored Interrupt Controller) manages interrupts
- Device tree describes platform hardware
- AArch64 uses 4-level page tables similar to x64
- No BIOS - bootloader responsibilities differ
Common Mistakes
- Wrong base addresses - Each platform has different peripheral addresses
- Endianness confusion - ARM can be little or big endian
- Cache coherency - Not invalidating caches after MMU setup
- Alignment - ARM requires aligned memory access
- Missing memory barriers - ARM has relaxed memory model
- Wrong exception return - Use
subs pc, lr, #4for IRQ - Forgetting device tree - Real hardware needs proper device enumeration
Debugging Tips
- Use UART early - First thing to get working
- QEMU is your friend - Test before real hardware
- GDB multiarch - Use
gdb-multiarchfor ARM - Check alignment - ARM faults on unaligned access
- Memory barriers - Use
dmb,dsb,isbappropriately - Read manuals - ARM Architecture Reference Manual is essential
- Start with QEMU - Versatile PB is well-supported
Mini Exercises
- Create a basic ARM kernel that prints to UART
- Implement simple printf for UART
- Set up MMU with identity mapping
- Create exception handlers for all vectors
- Initialize timer interrupt
- Implement basic keyboard/UART input
- Parse device tree to find UART address
- Port kernel to Raspberry Pi
- Implement AArch64 boot code
- Add multi-core support (boot secondary cores)
Review Questions
- How does ARM boot process differ from x86?
- What is a device tree and why is it used?
- How do you enable the MMU on ARMv7?
- What are the ARM exception vectors?
- How does UART differ between ARM and x86?
Reference Checklist
By the end of this chapter, you should be able to:
- Set up ARM cross-compilation toolchain
- Write ARM boot assembly code
- Initialize UART for serial output
- Set up ARM MMU (ARMv7)
- Handle ARM exceptions and interrupts
- Initialize interrupt controller (VIC)
- Set up timer interrupts
- Understand device trees
- Port kernel between ARM platforms
- Use QEMU for ARM kernel testing
Next Steps
With both x86/x64 and ARM kernel experience, the next chapter explores Unix, Linux, and shell scripting. You'll learn Linux system programming, shell scripting for automation, and how to interact with the Linux kernel from user space.
Key Takeaway: ARM kernel development differs from x86 in boot process, memory management, and peripheral access. Understanding these differences and using device trees enables you to write kernels for the vast ARM ecosystem.