Chapter 5: The KVM API

You have a CPU that supports VMX non-root mode and a kernel module, kvm.ko, that can put it there. The question now is how userspace talks to that module. The kernel could expose KVM through a sysfs hierarchy, a netlink socket, a purpose-built syscall, or a device file. It chose the last: a single character device at /dev/kvm whose entire surface is made of ioctl calls. That choice has consequences for how a VMM is structured — and for how you read one. The reference VMM throughout is Firecracker; the canonical teaching program is the LWN article by Josh Triplett, published September 29, 2015.

Three File Descriptor Scopes

The KVM ioctl surface divides cleanly into three scopes, each represented by a different file descriptor. Mixing them is not silently wrong — the kernel returns ENOTTY, which is the right error for "this ioctl does not apply to this file descriptor." The structure forces the VMM to build its object graph in a fixed order: a system fd produces a VM fd, which produces vCPU fds. None of the objects can be created out of sequence.

flowchart LR dev["/dev/kvm (system fd)"] vmfd["VM fd"] vcpufd["vCPU fd"] dev -- "KVM_CREATE_VM" --> vmfd vmfd -- "KVM_CREATE_VCPU" --> vcpufd

Every KVM ioctl number is constructed using the _IO, _IOR, or _IOW macros with the magic byte KVMIO = 0xAE. That byte is the namespace; the low byte is the command. KVM_GET_API_VERSION is _IO(0xAE, 0x00), KVM_CREATE_VM is _IO(0xAE, 0x01), KVM_RUN is _IO(0xAE, 0x80). The full reference table appears at the end of this chapter.

The system fd covers VM lifecycle and capability queries. The VM fd covers memory configuration, interrupt-controller setup, and event fd wiring. The vCPU fd covers execution and register access. Two threading rules matter: VM ioctls must be issued from the same process address space used to create the VM, and vCPU ioctls should be issued from the same OS thread that created the vCPU — Firecracker enforces the latter by running each vCPU in its own dedicated OS thread named fc_vcpu {index}.

The API Version Handshake

Before issuing any other ioctl, a VMM opens /dev/kvm with O_RDWR | O_CLOEXEC and checks the API version:

int kvmfd = open("/dev/kvm", O_RDWR | O_CLOEXEC);
int api_version = ioctl(kvmfd, KVM_GET_API_VERSION, NULL);

KVM_GET_API_VERSION returns the integer 12. It has returned 12 since the API was stabilized in Linux 2.6.22, and the kernel documentation states it is not expected to change. The KVM documentation says applications must refuse to run if they receive any other value. Firecracker checks this immediately at startup — kvm_fd.get_api_version() != 12 triggers an early abort — because the rest of the initialization sequence assumes version 12 semantics throughout.

Once version 12 is confirmed, all ioctls documented with the capability tag "basic" are guaranteed available without further negotiation. Capabilities beyond "basic" require explicit KVM_CHECK_EXTENSION queries.

Capability Negotiation

KVM_CHECK_EXTENSION is _IO(0xAE, 0x03). It accepts a KVM_CAP_* integer and returns 0 if the capability is absent or a positive integer if it is present — the exact positive value is capability-defined; for some it is just 1, for others it carries useful information like a count or size. A subset of capabilities can also be queried on the VM fd, if KVM_CAP_CHECK_EXTENSION_VM is supported, giving per-VM answers rather than per-system ones.

Firecracker on x86_64 checks exactly 14 capabilities at startup, referred to internally as DEFAULT_CAPABILITIES: KVM_CAP_IRQCHIP, KVM_CAP_IOEVENTFD, KVM_CAP_IRQFD, KVM_CAP_USER_MEMORY, KVM_CAP_SET_TSS_ADDR, KVM_CAP_PIT2, KVM_CAP_PIT_STATE2, KVM_CAP_ADJUST_CLOCK, KVM_CAP_DEBUGREGS, KVM_CAP_MP_STATE, KVM_CAP_VCPU_EVENTS, KVM_CAP_XCRS, KVM_CAP_XSAVE, and KVM_CAP_EXT_CPUID. It checks KVM_CAP_XSAVE2 separately at runtime to decide whether to use the dynamically-sized XSAVE buffer path. A missing required capability is a fatal error; Firecracker never falls back to working without it.

The capabilities used directly in this chapter are KVM_CAP_USER_MEMORY (value 3), which gates KVM_SET_USER_MEMORY_REGION, and KVM_CAP_NR_MEMSLOTS (value 10), which returns the maximum number of memory slots the kernel supports. Query KVM_CAP_NR_MEMSLOTS at runtime; do not hardcode a slot limit.

Creating a VM

KVM_CREATE_VM is _IO(0xAE, 0x01), issued on the system fd. It takes a machine-type argument; pass 0 for the default x86 machine type. It returns a new VM file descriptor.

int vmfd = ioctl(kvmfd, KVM_CREATE_VM, 0);

The returned VM fd has no vCPUs and no memory. Neither defaults to anything useful; both must be configured before the first KVM_RUN call. An empty VM is not an error — it is a valid but uninitialized state.

One non-obvious failure mode: on a heavily loaded machine, KVM_CREATE_VM can return -1 with errno = EINTR. This is intentional. The kernel path calls mm_take_all_locks(), which is CPU-intensive and checks for pending signals. Firecracker handles this by retrying up to five times with exponential backoff starting at 1 µs, doubling on each attempt (1 µs, 2 µs, 4 µs, 8 µs, 16 µs). A VMM that treats EINTR from KVM_CREATE_VM as a fatal error will occasionally fail on healthy hosts under load.

Mapping Guest Physical Memory

The guest needs physical memory before it can run. KVM's model for guest memory is deliberately minimal: the VMM allocates host memory however it chooses — mmap, malloc, a memfd — and then registers the result with KVM via KVM_SET_USER_MEMORY_REGION. KVM does not manage guest memory allocation; it manages the mapping between host virtual addresses and guest physical addresses.

Note: These operations require that /dev/kvm is open and a VM fd exists. The program running the following ioctls must have read/write access to /dev/kvm, which on most Linux distributions means membership in the kvm group or root privilege. The backing mmap must remain valid for the entire lifetime of the VM; premature deallocation is undefined behavior with consequences the kernel cannot catch.

KVM_SET_USER_MEMORY_REGION is _IOW(0xAE, 0x46, struct kvm_userspace_memory_region), issued on the VM fd. The struct, from include/uapi/linux/kvm.h:

struct kvm_userspace_memory_region {
    __u32 slot;              /* bits 0-15: slot index;
                                bits 16-31: address space id
                                (requires KVM_CAP_MULTI_ADDRESS_SPACE) */
    __u32 flags;
    __u64 guest_phys_addr;   /* base guest physical address */
    __u64 memory_size;       /* bytes; 0 deletes the slot */
    __u64 userspace_addr;    /* host virtual address of backing memory */
};

guest_phys_addr is the base of the range in the guest's physical address space. userspace_addr is the host virtual address where the backing memory lives. The terminology is worth pausing on: the guest believes it is accessing physical memory at guest_phys_addr. The host sees an ordinary virtual address range. EPT provides the hardware translation between those two views, with the kernel building EPT entries that map guest-physical addresses to the host-physical pages underlying userspace_addr. An EPT violation — a guest access to a guest-physical address the EPT does not yet map — faults in through the kernel's EPT violation handler, without surfacing to the VMM userspace process at all, for addresses that the registered slots do cover. An access outside all registered slots produces KVM_EXIT_MMIO or, in recent kernels, KVM_EXIT_MEMORY_FAULT.

The flags field has three defined bits:

Flag Bit Meaning
KVM_MEM_LOG_DIRTY_PAGES 1UL << 0 Enable dirty page tracking for live migration
KVM_MEM_READONLY 1UL << 1 Read-only slot (guest writes produce a fault)
KVM_MEM_GUEST_MEMFD 1UL << 2 Backed by a guest memfd (TDX / AMD SNP only)

Firecracker assigns slot IDs via an atomic counter (next_kvm_slot: AtomicU32), sets flags = KVM_MEM_LOG_DIRTY_PAGES when dirty-page tracking is active for a snapshot, and passes flags = 0 otherwise. Calling the ioctl with an existing slot number replaces the prior registration in-place. Passing memory_size = 0 deletes the slot. For huge-page-backed memory the lower 21 bits of guest_phys_addr and userspace_addr should match so that EPT huge-page entries can be built without splitting the host's huge pages.

A newer KVM_SET_USER_MEMORY_REGION2 (_IOW(0xAE, 0x49, struct kvm_userspace_memory_region2)) extends the struct with a guest_memfd field for confidential VMs running under Intel TDX or AMD SEV-SNP, where guest memory must be isolated from the host even at the hypervisor level. For standard Firecracker use, the original ioctl is sufficient.

Creating a vCPU

With memory mapped, the next step is a vCPU. KVM_CREATE_VCPU is _IO(0xAE, 0x41), issued on the VM fd. It takes a vcpu_id integer that acts as the APIC ID on x86:

int vcpufd = ioctl(vmfd, KVM_CREATE_VCPU, 0);

Valid IDs are in [0, max_vcpu_id) where max_vcpu_id comes from KVM_CHECK_EXTENSION(KVM_CAP_MAX_VCPU_ID) (capability 128). Firecracker uses a 0-based index for vcpu_id and assigns each vCPU its own OS thread. For a two-vCPU guest, the host sees threads named fc_vcpu 0 and fc_vcpu 1, each blocked in KVM_RUN whenever the guest is executing.

The kvm_run Shared Region

Communication between the running guest and userspace happens through a shared memory region mapped onto the vCPU fd. Its size is not a compile-time constant — the kernel appends additional buffers after the struct kvm_run header, including I/O port data and coalesced MMIO pages — so the VMM must query it first:

int mmap_size = ioctl(kvmfd, KVM_GET_VCPU_MMAP_SIZE, NULL);
struct kvm_run *run = mmap(NULL, mmap_size,
                           PROT_READ | PROT_WRITE, MAP_SHARED,
                           vcpufd, 0);

KVM_GET_VCPU_MMAP_SIZE is _IO(0xAE, 0x04), a system ioctl. Using sizeof(struct kvm_run) as the mmap size is wrong; it truncates the region the kernel needs beyond the struct.

The layout of struct kvm_run from include/uapi/linux/kvm.h:

struct kvm_run {
    /* IN: written by userspace before KVM_RUN */
    __u8  request_interrupt_window;
    __u8  immediate_exit;
    __u8  padding1[6];

    /* OUT: written by kernel on exit */
    __u32 exit_reason;
    __u8  ready_for_interrupt_injection;
    __u8  if_flag;
    __u16 flags;

    /* SHARED: valid across all exits */
    __u64 cr8;
    __u64 apic_base;

    /* exit-specific data (anonymous union) */
    union { ... };

    /* optional sync-regs optimization */
    __u64 kvm_valid_regs;
    __u64 kvm_dirty_regs;
    union {
        struct kvm_sync_regs regs;
        char padding[2048];   /* SYNC_REGS_SIZE_BYTES */
    } s;
};

The two IN fields are written by userspace to control the next KVM_RUN call. immediate_exit set to 1 from a different thread causes KVM_RUN to return -1 with errno = EINTR; exit_reason is stale in that case and must not be read. Firecracker uses this to pause a vCPU for device-model events: another thread sets immediate_exit = 1, checks for EINTR, then clears the flag. Note that KVM_EXIT_INTR (10) is a separate mechanism — it appears as exit_reason on a normal zero-return from KVM_RUN when a signal is pending in certain configurations, unrelated to the -1/EINTR error path. request_interrupt_window set to 1 asks KVM to exit with KVM_EXIT_IRQ_WINDOW_OPEN (7) as soon as the vCPU's interrupt flag is set and a hardware interrupt could be injected, which is the mechanism for delivering queued interrupts when the guest had them masked.

exit_reason is written by the kernel on exit. Everything in the anonymous union that follows it is exit-specific and is only valid for the indicated exit code.

KVM_RUN

KVM_RUN is _IO(0xAE, 0x80), issued on the vCPU fd. No argument is passed; all communication goes through the mmap'd struct kvm_run.

int ret = ioctl(vcpufd, KVM_RUN, NULL);

The return value encodes outcome:

Return errno Meaning
0 Normal exit; read exit_reason and loop
-1 EINTR An unmasked signal interrupted the vCPU
-1 ENOEXEC vCPU not yet initialized
-1 EFAULT or EHWPOISON KVM_EXIT_MEMORY_FAULT (the one exit that sets errno rather than exit_reason)

The ordinary path returns 0. exit_reason then holds the reason code and the union holds the associated data. A return of -1 with EINTR means a signal arrived; the VMM should check for pending signals, handle them, and call KVM_RUN again. KVM_EXIT_MEMORY_FAULT (value 39) is the exception to the exit_reason convention: when a guest access faults on a memory region for which no slot is registered, KVM_RUN returns -1 with errno = EFAULT or EHWPOISON, not the usual 0 with exit_reason = 39.

KVM handles many exits internally without returning to userspace at all: EPT violations for pages within registered slots, APIC accesses with an in-kernel irqchip, and MSR reads and writes for MSRs KVM manages itself. The exits that surface to userspace are the ones the VMM must handle: I/O port accesses, MMIO accesses to unregistered regions, HLT, shutdown, and fail-entry. The set of in-kernel exits versus userspace exits is determined by the capabilities and irqchip configuration established before the first KVM_RUN call.

The Run Loop

The canonical C run loop, following the structure of the LWN example:

while (1) {
    ioctl(vcpufd, KVM_RUN, NULL);
    switch (run->exit_reason) {
    case KVM_EXIT_HLT:
        /* guest executed HLT with nothing queued */
        return 0;
    case KVM_EXIT_IO:
        /* data at (char *)run + run->io.data_offset */
        putchar(*(((char *)run) + run->io.data_offset));
        break;
    case KVM_EXIT_FAIL_ENTRY:
        errx(1, "hardware_entry_failure_reason = 0x%llx",
             run->fail_entry.hardware_entry_failure_reason);
    case KVM_EXIT_INTERNAL_ERROR:
        errx(1, "suberror = 0x%x", run->internal.suberror);
    default:
        errx(1, "unexpected exit reason: %d", run->exit_reason);
    }
}

Firecracker's run loop, in src/vmm/src/vstate/vcpu.rs, follows the same logic but with Rust's enum dispatch. The vCPU's run_emulation() function checks kvm_run.immediate_exit == 1 before calling fd.run(), so that a preemption request from another thread is honored without an unnecessary KVM_RUN call. On EINTR it clears immediate_exit and returns VcpuEmulation::Interrupted. The exit dispatch in handle_kvm_exit() routes MmioRead and MmioWrite to the MMIO bus, IoIn and IoOut to the port I/O bus, treats FailEntry and InternalError as VcpuError::FaultyKvmExit, interprets SystemEvent with type KVM_SYSTEM_EVENT_SHUTDOWN (1) or KVM_SYSTEM_EVENT_RESET (2) as VcpuEmulation::Stopped, and returns VcpuError::UnhandledKvmExit for anything unrecognized. Device model dispatch for MMIO and port I/O happens synchronously in the vCPU thread; there is no separate I/O thread for those paths.

The sequence diagram below shows the two paths a VM exit can take:

sequenceDiagram participant G as "Guest (VMX non-root)" participant K as "KVM" participant V as "VMM (firecracker)" G->>K: VM exit (any reason) K->>K: classify exit_reason alt handled in-kernel K->>G: VMRESUME (no userspace involvement) else requires VMM K->>V: KVM_RUN returns 0 V->>V: dispatch on exit_reason V->>K: ioctl(vcpufd, KVM_RUN, NULL) K->>G: VMRESUME end

Exit Reason Codes

The full enumeration in include/uapi/linux/kvm.h currently runs from KVM_EXIT_UNKNOWN (0) through KVM_EXIT_SNP_REQ_CERTS (43). For x86_64 microVM workloads, the exits that actually matter are:

Value Constant What triggered it
2 KVM_EXIT_IO Guest IN or OUT instruction
5 KVM_EXIT_HLT Guest HLT with no pending interrupt
6 KVM_EXIT_MMIO Guest access to an MMIO address
7 KVM_EXIT_IRQ_WINDOW_OPEN Interrupt window requested and now open
8 KVM_EXIT_SHUTDOWN Guest triple-fault or CPU reset
9 KVM_EXIT_FAIL_ENTRY Hardware refused VM entry
10 KVM_EXIT_INTR Signal pending; normal zero-return exit
16 KVM_EXIT_NMI NMI delivered to guest (x86, not handled by Firecracker)
17 KVM_EXIT_INTERNAL_ERROR KVM internal error
24 KVM_EXIT_SYSTEM_EVENT Guest requested shutdown or reset

Values 13–15 are S390/PowerPC-specific. Value 16 (KVM_EXIT_NMI) is x86 but Firecracker routes it to VcpuError::UnhandledKvmExit. Values 18–23, 25–27, and 28–43 cover S390, PowerPC, ARM, RISC-V, Xen, LoongArch, TDX, and AMD SNP exits outside the Firecracker scope.

The Exit-Specific Union

Each exit reason with userspace-visible data populates one arm of the anonymous union in struct kvm_run. Four arms are worth examining in detail.

KVM_EXIT_IO (2)

struct {
    __u8  direction;     /* KVM_EXIT_IO_IN = 0, KVM_EXIT_IO_OUT = 1 */
    __u8  size;          /* 1, 2, or 4 bytes per operation */
    __u16 port;          /* I/O port number */
    __u32 count;         /* number of operations (REP prefix) */
    __u64 data_offset;   /* offset from start of kvm_run to data buffer */
} io;

The data is not embedded in the union. It lives at (char *)run + run->io.data_offset, which is exactly why the mmap region is larger than sizeof(struct kvm_run) — the kernel places the I/O data in the overhang. Total bytes transferred are io.count * io.size. On a COM1 write at port 0x3f8, io.port = 0x3f8, io.direction = KVM_EXIT_IO_OUT, io.size = 1, io.count = 1, and the character lives at ((char *)run) + run->io.data_offset.

KVM_EXIT_MMIO (6)

struct {
    __u64 phys_addr;   /* guest physical address */
    __u8  data[8];     /* data bytes (embedded; max 8 bytes) */
    __u32 len;         /* 1, 2, 4, or 8 */
    __u8  is_write;    /* 0 = read, 1 = write */
} mmio;

Unlike KVM_EXIT_IO, the MMIO data is embedded directly in the struct at run->mmio.data[0]. There is no data_offset indirection. For a read, the VMM writes the response into data before calling KVM_RUN again; KVM will relay it to the guest.

KVM_EXIT_FAIL_ENTRY (9)

struct {
    __u64 hardware_entry_failure_reason;  /* VMX/SVM hardware error code */
    __u32 cpu;                            /* physical CPU on which the failure occurred */
} fail_entry;

KVM_EXIT_FAIL_ENTRY means the hardware declined to enter VMX non-root mode. The hardware_entry_failure_reason is a VMX VM-entry failure qualification or an SVM equivalent. This exit is almost always the result of incorrect segment descriptor setup, misaligned page tables, or an invalid VMCS state. It is not recoverable; the VMM should log the reason code and tear down the VM.

KVM_EXIT_INTERNAL_ERROR (17)

struct {
    __u32 suberror;   /* see below */
    __u32 ndata;      /* count of valid entries in data[] */
    __u64 data[16];
} internal;

suberror takes four defined values: KVM_INTERNAL_ERROR_EMULATION (1, the kernel failed to emulate an instruction), KVM_INTERNAL_ERROR_SIMUL_EX (2, simultaneous exceptions), KVM_INTERNAL_ERROR_DELIVERY_EV (3, event delivery error), and KVM_INTERNAL_ERROR_UNEXPECTED_EXIT_REASON (4). Like KVM_EXIT_FAIL_ENTRY, this exit is not recoverable in normal operation. Firecracker treats it as VcpuError::FaultyKvmExit and halts the VM.

KVM_EXIT_SYSTEM_EVENT (24)

struct {
    __u32 type;
    __u32 ndata;
    union { __u64 data[16]; };
} system_event;

type carries KVM_SYSTEM_EVENT_SHUTDOWN (1), KVM_SYSTEM_EVENT_RESET (2), KVM_SYSTEM_EVENT_CRASH (3), KVM_SYSTEM_EVENT_WAKEUP (4), KVM_SYSTEM_EVENT_SUSPEND (5), or KVM_SYSTEM_EVENT_SEV_TERM (6). This is the path a well-behaved guest kernel uses when it calls reboot(2) or shuts down cleanly; KVM_EXIT_SHUTDOWN (8) is the fault path when the guest triple-faults. Firecracker maps both to VcpuEmulation::Stopped and performs an orderly teardown.

Reading and Writing Guest Registers

Before the first KVM_RUN call, the VMM must initialize the vCPU register state. The default state after KVM_CREATE_VCPU leaves the vCPU pointed at the x86 reset vector — CS base at the top of the 4 GiB address space, RIP at 0xFFF0 — which is appropriate for BIOS-style boot but wrong for direct kernel boot. Firecracker skips the BIOS entirely and sets registers to the Linux 64-bit boot protocol entry state before the first run.

KVM_GET_REGS and KVM_SET_REGS

KVM_GET_REGS is _IOR(0xAE, 0x81, struct kvm_regs). KVM_SET_REGS is _IOW(0xAE, 0x82, struct kvm_regs). Both are vCPU ioctls. On x86, from arch/x86/include/uapi/asm/kvm.h:

struct kvm_regs {
    __u64 rax, rbx, rcx, rdx;
    __u64 rsi, rdi, rsp, rbp;
    __u64 r8,  r9,  r10, r11;
    __u64 r12, r13, r14, r15;
    __u64 rip, rflags;
};

rflags must be initialized to at least 0x2. Bit 1 is architecturally reserved-must-be-1; a VM will fail to enter VMX non-root mode without it, and the failure surfaces as KVM_EXIT_FAIL_ENTRY.

Firecracker's setup_regs() in src/vmm/src/arch/x86_64/regs.rs writes two configurations depending on boot protocol:

Linux 64-bit boot protocol
  rflags = 0x2
  rip    = kernel_entry_addr
  rsp    = BOOT_STACK_POINTER
  rbp    = BOOT_STACK_POINTER
  rsi    = ZERO_PAGE_START   // pointer to boot-params zero page
PVH boot entry point
  rflags = 0x2
  rip    = entry_addr
  rbx    = PVH_INFO_START    // address of PVH start_info structure

KVM_GET_SREGS and KVM_SET_SREGS

The general-purpose registers in kvm_regs cover computation but not the CPU's operating mode. Segment registers, descriptor-table registers, control registers, and EFER live in a separate ioctl pair. KVM_GET_SREGS is _IOR(0xAE, 0x83, struct kvm_sregs). KVM_SET_SREGS is _IOW(0xAE, 0x84, struct kvm_sregs).

struct kvm_sregs {
    struct kvm_segment cs, ds, es, fs, gs, ss;
    struct kvm_segment tr, ldt;
    struct kvm_dtable gdt, idt;
    __u64 cr0, cr2, cr3, cr4, cr8;
    __u64 efer;
    __u64 apic_base;
    __u64 interrupt_bitmap[(KVM_NR_INTERRUPTS + 63) / 64]; /* 4 x __u64 */
};

KVM_NR_INTERRUPTS is 256, so interrupt_bitmap is four 64-bit words — a bitmap of pending external interrupts, one bit per IRQ vector 0–255.

struct kvm_segment maps directly to x86 descriptor fields:

struct kvm_segment {
    __u64 base;
    __u32 limit;
    __u16 selector;
    __u8  type;       /* 4-bit x86 segment type */
    __u8  present, dpl, db, s, l, g, avl;
    __u8  unusable;
    __u8  padding;
};

dpl is the descriptor privilege level (0–3), db is the default operand size, s distinguishes system (0) from code/data (1) descriptors, l marks a 64-bit code segment, g is the granularity bit, and avl is available for OS use.

The CS fix-up that every minimal KVM example needs: a freshly created vCPU's CS still has its base pointing to the reset vector at the top of the 4 GiB address space. Before setting RIP to an arbitrary guest physical address, the VMM must redirect CS. The LWN example does exactly this:

sregs.cs.base = 0;
sregs.cs.selector = 0;

Without this, a vCPU with rip = 0x1000 and CS base still at 0xFFFF0000 would execute at effective address 0xFFFF1000, not 0x1000. The fix is two fields, not one.

For Firecracker's 64-bit long-mode setup, setup_sregs() works through a longer sequence: (1) call KVM_GET_SREGS to fetch the vCPU's current state, (2) write a 4-entry GDT at guest physical address BOOT_GDT_OFFSET = 0x500 and a stub IDT at BOOT_IDT_OFFSET = 0x520, (3) set cr0 with PE | PG (0x1 | 0x80000000) to enter protected mode with paging enabled, cr4 with PAE (0x20) to enable physical address extension, and efer with LME | LMA (0x100 | 0x400) to activate 64-bit long mode, (4) install page tables with PML4 at GPA 0x9000, PDPTE at 0xa000, and PDE at 0xb000, and (5) commit with KVM_SET_SREGS. The PVH path sets cr0 with PE | ET (0x1 | 0x10) instead, reflecting the different boot entry contract. The guest vCPU enters its first KVM_RUN already in 64-bit long mode, with valid page tables and a stack — there is no mode-switch sequence during boot at all.

The Ioctl Number Reference

All ioctls discussed in this chapter:

Ioctl Macro Scope
KVM_GET_API_VERSION _IO(0xAE, 0x00) System
KVM_CREATE_VM _IO(0xAE, 0x01) System
KVM_CHECK_EXTENSION _IO(0xAE, 0x03) System / VM
KVM_GET_VCPU_MMAP_SIZE _IO(0xAE, 0x04) System
KVM_CREATE_VCPU _IO(0xAE, 0x41) VM
KVM_SET_USER_MEMORY_REGION _IOW(0xAE, 0x46, struct kvm_userspace_memory_region) VM
KVM_RUN _IO(0xAE, 0x80) vCPU
KVM_GET_REGS _IOR(0xAE, 0x81, struct kvm_regs) vCPU
KVM_SET_REGS _IOW(0xAE, 0x82, struct kvm_regs) vCPU
KVM_GET_SREGS _IOR(0xAE, 0x83, struct kvm_sregs) vCPU
KVM_SET_SREGS _IOW(0xAE, 0x84, struct kvm_sregs) vCPU

For ioctls built with _IOW or _IOR, the numeric ioctl value includes the struct size encoded in bits 16–29. Compute it from the macro rather than hardcoding the integer.

What the API Surface Reveals

The three-scope ioctl model is not arbitrary. It enforces a lifecycle: a system fd cannot issue vCPU commands, and a VM fd cannot run code. Crossing a scope boundary is an immediate ENOTTY. This structure also means the kernel can perform per-scope privilege checks — a process that holds only a VM fd forwarded over a Unix socket cannot enumerate other VMs on the system.

The shared kvm_run region is the design's most interesting tradeoff. KVM could have passed exit information in ioctl arguments, but shared memory allows the immediate_exit and request_interrupt_window fields to be written from a different thread without any syscall. The cost is that the VMM must be careful about concurrent access to the region — one field is in, one is out, and the kernel writes exit_reason only when KVM_RUN returns. In practice the per-vCPU thread model Firecracker uses makes the concurrency straightforward: the vCPU thread owns the region except for the two flags another thread might set.

The next chapter works through what happens when the guest accesses an address outside any registered memory slot, and how EPT and NPT mediate the full address translation from guest-virtual to host-physical through two independent page table walks.

Sources And Further Reading