Chapter 10: Booting A Guest Kernel

The previous chapter left the vCPU in reset state — halted, expecting firmware to run first. In a conventional VM that firmware is SeaBIOS or OVMF. It runs POST, scans PCI, builds an E820 memory map, loads GRUB, which reads a config file, which decompresses a kernel, which eventually calls start_kernel. The whole sequence takes hundreds of milliseconds and adds two software layers that the guest can never directly observe or control. For a microVM that is designed to start in under 125 ms and whose kernel image is known at build time, firmware is pure overhead. This chapter is about eliminating it entirely.

The question the chapter answers is precise: given a KVM file descriptor, a guest physical address space, and a kernel image on disk, what is the exact machine state the VMM must establish before issuing KVM_RUN so that the kernel's first instruction executes correctly with no BIOS, bootloader, or decompressor?

The answer splits into two branches depending on the kernel image format and build configuration. Both branches end at the same place — the kernel's start_kernel in init/main.c — but they take different paths through the x86_64 architecture's mode hierarchy to get there. Understanding them requires knowing what the kernel expects the CPU and memory to look like at its entry point, which means reading the boot protocol directly.

The Two Boot Paths

Before the Linux boot protocol existed, every bootloader assumed a different entry state and every kernel shipped a compatibility shim that tried to detect which one it was running under. The protocol, formalized starting with version 0x0200 in Linux 1.3.73, replaced that chaos with a contract: the VMM or bootloader fills in a structured record in guest memory, places its guest-physical address in a well-known register, and jumps to a well-known offset in the kernel image. The kernel reads the record at startup and learns everything it cannot detect directly — where its initrd lives, what the command line says, what physical memory ranges are available.

There is a second contract, newer and narrower in scope: the PVH boot ABI, which originated in the Xen project and entered mainline Linux in version 5.0. PVH makes a different tradeoff. It sacrifices the 64-bit paging setup that the Linux protocol requires the loader to perform, in exchange for a simpler CPU entry state — 32-bit protected mode with paging disabled — and substitutes a smaller data structure, hvm_start_info, for the Linux boot_params. The kernel's PVH entry point then transitions itself to 64-bit long mode. Any x86_64 microVM VMM has to know both paths, because which one applies depends on which ELF notes the kernel image contains.

Firecracker on x86_64 implements both. It inspects the loaded ELF kernel for a XEN_ELFNOTE_PHYS32_ENTRY note (type 18, name "Xen"); if the note is present, Firecracker uses PVH as the preferred path. If the note is absent, it falls back to the 64-bit Linux protocol. The choice is made per image at every boot. The rest of this chapter covers the Linux protocol in depth first, then the PVH path as a variation, because most existing documentation — and most existing kernels — leads with the Linux protocol.

vmlinux vs. bzImage

The image format determines how the VMM extracts both the entry point and the setup data.

vmlinux is the raw ELF produced by the kernel build. It is statically linked, 64-bit, uncompressed, and carries the full ELF header that any standard ELF loader can parse. The VMM loads its PT_LOAD segments directly into guest RAM and reads the ELF e_entry field to find startup_64, the first kernel instruction, in arch/x86/kernel/head_64.S. No decompression step runs, because there is nothing to decompress.

bzImage is a different animal. It is a self-extracting archive: the first (setup_sects + 1) × 512 bytes are a real-mode setup blob in 16-bit code, and the bytes after that are the protected-mode kernel compressed according to whatever CONFIG_KERNEL_* option was selected — gzip, LZMA, or zstd. A bootloader that ingests a bzImage has to extract the setup_header from file offset 0x01F1, potentially run the real-mode stub to set up the environment, and let the embedded decompressor unpack the kernel before any kernel code runs. Measured against a direct ELF load, Stefano Garzarella (Red Hat) observed approximately 78 ms for the compressed bzImage path in QEMU 4.0, versus approximately 10 ms for PVH entry via vmlinux in QEMU 4.0 — a 7.8x difference, primarily attributable to the in-guest decompression and real-mode phases.

Firecracker requires an uncompressed ELF vmlinux on x86_64. It does not support bzImage. Build the guest kernel with make vmlinux. On aarch64, Firecracker instead requires the PE-format Image file produced by make Image.

The rust-vmm linux-loader crate, which Firecracker uses internally, supports three formats: raw ELF on x86_64 (for both the Linux protocol and PVH), bzImage on x86_64, and PE Image on aarch64/riscv64. For an ELF kernel, linux-loader returns only the kernel entry address. For bzImage it additionally extracts the setup_header from file offset 0x01F1 and returns it alongside the entry address. In both cases the VMM is responsible for constructing struct boot_params from that header and from its own knowledge of the memory layout.

The Setup Header and the Zero Page

The boot protocol's central data structure is struct boot_params, a 4096-byte, packed C struct that the Linux kernel defines in arch/x86/include/uapi/asm/bootparam.h. The kernel documentation calls the page it occupies the zero page, because early boot code expects to find it at the start of the setup segment — historically at physical address 0. In a direct-boot microVM, the VMM places it at whatever guest-physical address it chooses, as long as it communicates that address to the kernel via RSI at entry.

Embedded within boot_params at offset 0x01F1 is struct setup_header, roughly 144 bytes of protocol fields. For a bzImage, the setup header is copied verbatim from the image file at the same byte offset. For an ELF vmlinux, the VMM fills in a minimal synthetic header, because the fields the kernel cares about at direct-boot time are only those the VMM itself controls — loader type, command-line address, initrd address, and memory alignment.

The VMM identifies a valid kernel image by checking two magic numbers that arch/x86/boot/header.S writes at link time:

boot_flag: .word 0xAA55
header:    .ascii "HdrS"        # 0x5372_6448 little-endian

boot_flag at image offset 0x01FE must equal 0xAA55 — a repurposed MBR signature. header at offset 0x0202 must equal 0x5372_6448 (ASCII HdrS). The rust-vmm bzImage loader rejects any image where boot_header.header != 0x5372_6448. The protocol version is a two-byte little-endian value at offset 0x0206: (major << 8) | minor. Current kernels report 0x020F (protocol 2.15), introduced in Linux 5.5.

Protocol versions matter because they gate individual fields. A VMM that writes cmd_line_ptr assumes at least protocol 2.02 (Linux 2.4.0-test3-pre3). The cmdline_size field at 0x0238 requires protocol 2.06 (Linux 2.6.22). The setup_data linked-list pointer at 0x0250 — used to pass ACPI tables, DTB blobs, and EFI memory maps without extending boot_params itself — requires protocol 2.09 (Linux 2.6.26). The xloadflags field at 0x0236, whose bit 0 (XLF_KERNEL_64) indicates a valid 64-bit entry point at load_addr + 0x200, requires protocol 2.12 (Linux 3.8). Any modern kernel the microVM stack would care about supports at least 2.12; protocol 2.15 kernels are the current baseline.

The fields a VMM must fill in setup_header for a direct 64-bit boot are a small subset of the full spec:

Offset Field Required value Meaning
0x0210 type_of_loader 0xFF No registered bootloader ID
0x01FE boot_flag 0xAA55 Sanity sentinel
0x0202 header 0x5372_6448 Protocol magic
0x0228 cmd_line_ptr 32-bit GPA Guest-physical address of command line
0x0238 cmdline_size byte count Length of command line excluding null
0x0230 kernel_alignment 0x0100_0000 16 MiB alignment (Firecracker value)
0x0218 ramdisk_image 32-bit GPA Initrd start; zero if none
0x021C ramdisk_size byte count Initrd size in bytes; zero if none

Firecracker sets type_of_loader to 0xFF, the catch-all value KERNEL_LOADER_OTHER defined by the boot protocol for VMMs and loaders without registered IDs.

Populating the Zero Page

The full boot_params struct is 4096 bytes. Most of it is zeroed. Beyond the setup_header section, two fields carry information the kernel cannot derive independently: the ACPI RSDP address and the physical memory map.

Firecracker places boot_params at guest-physical address 0x7000 (ZERO_PAGE_START). It fills acpi_rsdp_addr at boot_params[0x070] with the address 0x000E_0000, where the ACPI tables were built. The memory map lives at boot_params[0x2D0] as an array of up to 128 boot_e820_entry structs — each 20 bytes: a u64 start address, a u64 size, and a u32 type. The count of valid entries goes at boot_params[0x1E8] as a u8. Type 1 is E820_RAM (usable); type 2 is E820_RESERVED.

Firecracker's configure_64bit_boot() in src/vmm/src/arch/x86_64/mod.rs builds the table with four classes of entry:

  1. [0x0000_0000, 0x9FC00) → type 1: usable RAM below the Extended BIOS Data Area.
  2. [0x9FC00, 0x9FC00 + 0x40400) → type 2: reserved region covering the EBDA, the MP table, and the ACPI area.
  3. [PCI_MMCONFIG_START, PCI_MMCONFIG_START + 256 MiB) → type 2: the PCIe ECAM window.
  4. [max(0x10_0000, region.start), region.end) → type 1 per DRAM region: usable RAM from 1 MiB upward.

That table is what the kernel reads at startup_64 to construct its own memory model. It replaces the E820 query that a real BIOS would answer.

The complete boot_params assembly that Firecracker writes looks like this:

params.acpi_rsdp_addr = 0x000E_0000 params.hdr.type_of_loader = 0xFF params.hdr.boot_flag = 0xAA55 params.hdr.header = 0x5372_6448 params.hdr.cmd_line_ptr = 0x0002_0000 // CMDLINE_START params.hdr.cmdline_size = params.hdr.kernel_alignment = 0x0100_0000 // 16 MiB params.hdr.ramdisk_image = // if initrd present params.hdr.ramdisk_size = // if initrd present params.e820_entries = params.e820_table[..] =

Once written, this page is the only document the kernel has to understand its environment. Everything that BIOS POST and GRUB would have discovered or constructed over hundreds of milliseconds is expressed in 4096 bytes assembled by the VMM in microseconds.

Placing the Kernel, Initrd, and Command Line

The address constants for x86_64 in Firecracker are defined in src/vmm/src/arch/x86_64/layout.rs. A tour of the low guest-physical address space shows how tightly packed the boot data is:

flowchart TB A["0x0500 — GDT (32 bytes)"] B["0x0520 — IDT (8 bytes, single null entry)"] J["0x2_0000 — kernel command line"] C["0x6000 — hvm_start_info (PVH path)"] D["0x7000 — struct boot_params / zero page"] E["0x8FF0 — boot stack (RSP, RBP)"] F["0x9000 — PML4 page table"] G["0xA000 — PDPTE"] H["0xB000 — PDE (512 entries, 2 MiB pages)"] I["0x9FC00 — system reserved region begins"] K["0x10_0000 — kernel ELF load base (HIMEM_START)"] L["top of DRAM — initrd (page-aligned down)"] A --> B --> J --> C --> D --> E --> F --> G --> H --> I K -.-> D L -.-> D

The kernel loads at HIMEM_START = 0x0010_0000 (1 MiB). Firecracker's linux-loader maps ELF PT_LOAD segments starting at that address, respecting the kernel_alignment of 0x0100_0000 (16 MiB) that the header declares. The kernel must be placed on a 16 MiB boundary at or above HIMEM_START.

The command line is a null-terminated C string at CMDLINE_START = 0x0002_0000 (128 KiB), with a maximum length of 2048 bytes including the null terminator. setup_header.cmd_line_ptr receives 0x0002_0000 as a 32-bit guest-physical address. setup_header.cmdline_size receives the length of the specific command line string the VMM is providing, excluding the null terminator. This is distinct from the cmdline_size field in the kernel image's own setup header (protocol 2.06+), which declares the maximum command line length the kernel will accept; the VMM writes the actual string length, bounded by that capacity.

The initrd, if present, is placed at the top of the first DRAM region, page-aligned downward: align_down(lowmem_end - initrd_size, PAGE_SIZE). This keeps the initrd as high as possible to avoid colliding with the kernel's own data. setup_header.ramdisk_image receives the resulting 32-bit guest-physical address; setup_header.ramdisk_size receives the byte count. The protocol defines initrd_addr_max (offset 0x022C, added in protocol 2.03) as the highest address the initrd's last byte may occupy; the default in arch/x86/boot/header.S is 0x7FFF_FFFF (just below 2 GiB), and a VMM that places the initrd above that limit must update the field or the kernel will reject it.

KVM_TSS_ADDRESS = 0xFFFB_D000 is a special case. Before any vCPU can run, the VMM must issue KVM_SET_TSS_ADDR with this guest-physical address on Intel VMX hosts. KVM uses a hidden 3-page TSS region at this address to support its internal real-mode emulation, even though Firecracker never enters real mode. The companion call KVM_SET_IDENTITY_MAP_ADDR places a one-page identity-map page adjacent to the TSS. Both are VM-level ioctls. KVM_SET_IDENTITY_MAP_ADDR must be called before any KVM_CREATE_VCPU; the KVM API docs explicitly state it fails if a vCPU already exists. KVM_SET_TSS_ADDR has no documented ordering constraint relative to vCPU creation.

Root required. The /dev/kvm device is accessible only to members of the kvm group (or root). Both KVM_SET_TSS_ADDR and KVM_SET_IDENTITY_MAP_ADDR modify guest-physical address space. Running them on a production host requires the same permissions as any other KVM operation.

CPU State at the 64-bit Entry Point

The Linux 64-bit boot protocol specifies an entry state that is already in long mode with paging active. This is the fundamental difference from a traditional bootloader, which enters the kernel in 32-bit protected mode and leaves mode switching to the kernel's own startup code. The 64-bit direct-boot protocol hands off in the mode the kernel will run in, making the kernel's startup_64 path a simpler target.

The VMM programs this state with three ioctls on the vCPU file descriptor: KVM_SET_REGS for general-purpose registers, KVM_SET_SREGS for segment registers and control registers, and KVM_SET_FPU for floating-point state. All three must be issued before the first KVM_RUN on the vCPU.

General-Purpose Registers

flowchart LR VMM["VMM: KVM_SET_REGS"] RIP["RIP = ELF e_entry\n(startup_64)"] RSI["RSI = 0x7000\n(boot_params GPA)"] RSP["RSP = 0x8FF0"] RBP["RBP = 0x8FF0"] RFLAGS["RFLAGS = 0x0002\n(IF=0, bit 1 set)"] REST["all other GPRs = 0"] VMM --> RIP & RSI & RSP & RBP & RFLAGS & REST

RIP is the ELF e_entry field, which resolves to the startup_64 symbol in arch/x86/kernel/head_64.S. This value is not a fixed numeric address — it depends on kernel configuration and, when KASLR is active, on randomization. The VMM reads it from the ELF header; it does not hardcode it.

RSI is the ABI contract: it must contain the 32-bit guest-physical address of struct boot_params. startup_64 reads RSI as its first act. Any other value is undefined behavior from the kernel's point of view.

RFLAGS is 0x0000_0000_0000_0002. Bit 1 is always 1 by architectural definition; IF (bit 9) is 0, leaving interrupts disabled until the kernel enables them during start_kernel. All other GPRs are zero.

Control Registers and Segment State

The VMM programs control registers and segment descriptors via KVM_SET_SREGS. The required values are:

Register Value Meaning
CR0 PE \| ET \| PG Protected mode, extension type, paging enabled
CR3 0x9000 Guest-physical address of the boot PML4
CR4 existing \| PAE (0x20) Physical Address Extension for 4-level paging
EFER existing \| LME (0x100) \| LMA (0x400) Long Mode Enable and Long Mode Active

PE is bit 0 of CR0 (0x1), ET is bit 4 (0x10), and PG is bit 31 (0x8000_0000). All three must be set simultaneously — turning on paging while not in protected mode is a fault. PAE is CR4 bit 5 (0x20); 64-bit paging requires it. LME and LMA together in EFER signal that the CPU is in long mode and that the current CS descriptor is 64-bit. KVM validates these relationships and will refuse to enter the guest if they are inconsistent.

The GDT lives at guest-physical address 0x500, with a 32-byte limit. Firecracker writes four entries in src/vmm/src/arch/x86_64/gdt.rs:

Index Selector Flags Purpose
0 0x0000 NULL descriptor (mandatory first entry)
1 0x08 0xA09B 64-bit code: L=1, G=1, P=1, DPL=0, type=0xB
2 0x10 0xC093 Data: G=1, DB=1, P=1, DPL=0, type=0x3
3 0x18 0x808B TSS: P=1, type=0xB (busy 32-bit TSS)

sregs.cs uses selector 0x08 (index 1, 64-bit code). sregs.ds, es, fs, gs, and ss all use selector 0x10 (index 2, data). sregs.tr uses selector 0x18 (index 3, TSS). The IDT at 0x520 is a single null 8-byte entry with sregs.idt.limit = 7; the kernel will install its own interrupt handlers early in start_kernel.

The Linux boot protocol documentation names these selectors __BOOT_CS (0x10) and __BOOT_DS (0x18), referring to positions 2 and 3 in the protocol's own GDT layout. Firecracker uses positions 1 and 2 (selectors 0x08 and 0x10) instead — a harmless divergence, because KVM validates the descriptor attributes (64-bit code segment, flat data segment) rather than the selector values themselves.

Boot-Time Page Tables

The boot page tables at PML4_START = 0x9000 create a minimal identity map covering guest-virtual [0, 1 GiB) using 2 MiB pages. Three levels suffice:

This maps GPA 0 through 1 GiB − 2 MiB with a 1:1 virtual-to-physical correspondence. The boot protocol requires the kernel's load range and the zero page both to be identity-mapped at entry, so that startup_64 can dereference RSI without a translation fault before it builds its own permanent page tables. Because 1 GiB of coverage includes 0x7000, 0x10_0000, 0x2_0000, and the initial stack at 0x8FF0, the three-level map is sufficient for boot.

Firecracker builds these tables in setup_page_tables() in src/vmm/src/arch/x86_64/regs.rs. The kernel's first task after startup_64 validates boot_params is to build its own permanent page tables, at which point the VMM's boot tables are no longer referenced.

FPU State

KVM_SET_FPU initializes floating-point state to the architectural defaults: fcw = 0x037F (x87 control word with all exception masks set and double-extended precision, PC = 10b) and mxcsr = 0x1F80 (MXCSR with all SSE exception masks set and round-to-nearest). The kernel overwrites these during fpu__init_cpu(), but KVM requires a valid initial state before entry.

The Full Boot Sequence

With those pieces in place, the boot sequence from KVM_RUN to start_kernel is a straight line with no firmware detour:

sequenceDiagram
    participant VMM as VMM process
    participant KVM as KVM kernel module
    participant CPU as Guest vCPU
    participant K64 as "startup_64 (head_64.S)"
    participant SK as "start_kernel (main.c)"

    VMM->>KVM: KVM_SET_TSS_ADDR, KVM_SET_IDENTITY_MAP_ADDR
    VMM->>KVM: KVM_CREATE_VCPU
    VMM->>VMM: load ELF segments at 0x10_0000
    VMM->>VMM: write boot_params at 0x7000
    VMM->>VMM: write cmdline at 0x2_0000
    VMM->>VMM: write initrd at top of DRAM
    VMM->>VMM: write GDT/IDT at 0x500/0x520
    VMM->>VMM: write page tables at 0x9000-0xB000
    VMM->>KVM: KVM_SET_REGS (RIP=e_entry, RSI=0x7000, ...)
    VMM->>KVM: KVM_SET_SREGS (CR0, CR3, CR4, EFER, GDT, ...)
    VMM->>KVM: KVM_SET_FPU
    VMM->>KVM: KVM_RUN
    KVM->>CPU: VMLAUNCH (Intel VMX) / VMRUN (AMD SVM)
    CPU->>K64: first instruction at RIP
    Note over K64: reads RSI → boot_params<br/>validates magic numbers<br/>reads e820_table, cmd_line_ptr
    K64->>K64: builds permanent page tables
    K64->>SK: x86_64_start_kernel() → start_kernel()

No BIOS runs. No UEFI runs. No GRUB runs. No decompressor runs. The path from KVM_RUN to startup_64 is a single VM-entry, and the path from startup_64 to start_kernel is kernel code running on a machine that the VMM fully configured before handing over control.

The PVH Path

PVH — "Para-Virtualised Hardware," formally the x86/HVM direct boot ABI — offers an alternative that removes the 64-bit paging requirement from the VMM. The kernel's PVH entry point accepts a 32-bit protected-mode state with paging disabled and transitions to long mode itself. That is a simpler entry contract for the VMM, but it means the kernel has more work to do before it can access 64-bit memory.

The ABI is signaled by an ELF PT_NOTE segment in the kernel image, with note name "Xen" (four bytes with null terminator) and note type XEN_ELFNOTE_PHYS32_ENTRY = 18. The note value is the 32-bit physical address of the PVH entry function in arch/x86/platform/pvh/enlighten.c. Linux gained CONFIG_PVH in version 5.0; any kernel built with that option exposes this entry point to any hypervisor, not just Xen. Firecracker merged PVH support in release v1.12.0 (PR #5048).

The data structure at entry is hvm_start_info, placed at guest-physical address PVH_INFO_START = 0x6000 in Firecracker's layout:

#define XEN_HVM_START_MAGIC_VALUE 0x336ec578

struct hvm_start_info {
    uint32_t magic;          /* must == 0x336ec578          */
    uint32_t version;        /* 0 = v0, 1 = v1              */
    uint32_t flags;          /* SIF_xxx flags                */
    uint32_t nr_modules;     /* count of modules             */
    uint64_t modlist_paddr;  /* phys addr of hvm_modlist_entry[] */
    uint64_t cmdline_paddr;  /* phys addr of command line    */
    uint64_t rsdp_paddr;     /* phys addr of ACPI RSDP      */
    /* v1 additions: */
    uint64_t memmap_paddr;   /* phys addr of memory map      */
    uint32_t memmap_entries; /* entry count (0 = no map)     */
    uint32_t reserved;       /* must be zero                 */
};

The magic 0x336ec578 is the ASCII string "xEn3" with the high bit of 'E' set — a Xen convention that predates the ABI's hypervisor-agnostic rebranding. The initrd rides in an hvm_modlist_entry (32 bytes: paddr u64, size u64, cmdline_paddr u64, reserved u64) pointed to by modlist_paddr.

The CPU state the PVH ABI mandates differs from the Linux 64-bit protocol in two critical ways: there is no paging (CR0.PG = 0), and the pointer to the info struct goes in EBX rather than RSI. The full required state:

Register / State Required value
CPU mode 32-bit protected mode
CR0 PE (bit 0) set; PG (bit 31) cleared
CR4 All bits cleared
CS 32-bit read/execute, base 0, limit 0xFFFF_FFFF
DS, ES, SS 32-bit read/write, base 0, limit 0xFFFF_FFFF
TR 32-bit TSS, base 0, limit 0x67
EFLAGS VM (bit 17), IF (bit 9), TF (bit 8) all cleared
EBX Guest-physical address of hvm_start_info

In Firecracker's PVH path, RBX = PVH_INFO_START = 0x6000. The GDT entries use 32-bit descriptors (0xC09B for code, 0xC093 for data, limit 0xFFFF_FFFF) instead of the 64-bit flags the Linux protocol uses. CR0 has only PE set — no PG. The kernel's PVH entry function at arch/x86/platform/pvh/enlighten.c builds its own page tables and switches to long mode, at which point execution joins the same head_64.S path that the 64-bit Linux protocol takes.

The practical advantage of PVH is not primarily the simpler VMM entry state — that difference is a few dozen lines of code either way. It is that PVH is naturally extensible to non-Linux kernels. FreeBSD and NetBSD both implement the HVM ABI. Firecracker's PVH support therefore works with any kernel that carries the XEN_ELFNOTE_PHYS32_ENTRY note, regardless of whether it is Linux.

What Gets Eliminated

A conventional VM boots through at minimum: BIOS POST, PCI device enumeration, an option ROM for each device, a boot block read from a disk, a second-stage bootloader such as GRUB, a kernel image decompressed in guest memory, and then kernel initialization. Each layer was designed for a world where the boot configuration was unknown at build time and had to be discovered at runtime from the hardware. A microVM VMM knows the configuration statically. The kernel image is pinned at deployment time. The memory layout is fixed. The hardware is synthetic and fully controlled. There is nothing to discover.

Sources And Further Reading