Appendix C: The KVM ioctl Reference

This appendix covers every ioctl the book uses, grouped by the scope of the fd it targets, with its _IO/_IOW/_IOR/_IOWR encoding, its raw hex number, and a plain-English account of its effect. The capability table at the end maps KVM_CAP_* constants to the ioctls they gate. For the mental model of why the three-fd hierarchy exists and what a VM exit looks like from the VMM's side, see Chapter 1. For how Firecracker wires KVM_IRQFD and KVM_IOEVENTFD into its device model, see Chapter 8.


The 32-Bit ioctl Encoding

Every KVM ioctl is a 32-bit integer assembled by _IOC(dir, type, nr, size) from four packed fields defined in include/uapi/asm-generic/ioctl.h.

Field Bits Width Meaning
nr 7–0 8 bits Function number within the subsystem
type 15–8 8 bits Magic byte identifying the subsystem
size 29–16 14 bits sizeof the argument struct (advisory only)
dir 31–30 2 bits Data direction: 0 = none, 1 = write-to-kernel, 2 = read-from-kernel, 3 = both

Four derived macros cover the common cases:

#define _IO(type, nr)            _IOC(0, (type), (nr), 0)
#define _IOR(type, nr, argtype)  _IOC(2, (type), (nr), sizeof(argtype))
#define _IOW(type, nr, argtype)  _IOC(1, (type), (nr), sizeof(argtype))
#define _IOWR(type, nr, argtype) _IOC(3, (type), (nr), sizeof(argtype))

For _IO ioctls, where dir = 0 and size = 0, the encoding collapses to (type << 8) | nr. Because every KVM ioctl uses the magic byte KVMIO = 0xAE, a system-scope _IO ioctl with function number nr always encodes as 0x0000AE00 | nr. The size field is advisory: the ioctl(2) man page notes that "the size bits are very unreliable — in lots of cases they are wrong," and the kernel does not rely on the encoded size for correctness.

The Linux ioctl-number registry assigns 0xAE to two subsystems: KVM (numbers 0x000x1F and 0x400xFF in linux/kvm.h) and AWS Nitro Enclaves (numbers 0x200x3F in linux/nitro_enclaves.h). On x86 and arm64 the encoding above is exact; PowerPC overrides _IOC_SIZEBITS and _IOC_DIRBITS, which shifts bits in the size and dir fields for _IOW/_IOR/_IOWR ioctls, though nr and type are identical on all architectures.


Three Scopes, Three File Descriptors

KVM organises its API around a three-level fd hierarchy. Issuing an ioctl on the wrong fd produces ENOTTY; there is no fallback.

flowchart TD A["open('/dev/kvm') → system fd"] --> B["ioctl(sysfd, KVM_CREATE_VM, 0) → VM fd"] B --> C["ioctl(vmfd, KVM_CREATE_VCPU, n) → vCPU fd"]

System fd. Opened directly from /dev/kvm. Ioctls here query or configure KVM as a whole: check the API version, create a VM, probe capabilities, fetch the supported CPUID set.

VM fd. Returned by KVM_CREATE_VM. Ioctls here configure one virtual machine: map memory, create vCPUs, wire the irqchip, register eventfd bindings.

vCPU fd. Returned by KVM_CREATE_VCPU. Ioctls here configure and run one virtual CPU: set registers, program CPUID, and issue KVM_RUN.

A threading constraint applies: VM ioctls must originate from the same process (address space) that called KVM_CREATE_VM. vCPU ioctls should come from the thread that called KVM_CREATE_VCPU, except for ioctls explicitly documented as asynchronous (the immediate_exit field, written from a signal handler, being the canonical exception).

Note. /dev/kvm is a character device with mode 0660, owned root:kvm. Opening it requires either root or membership in the kvm group. Any command or program in this appendix that opens /dev/kvm needs that access.


System Ioctls

Issued on the fd returned by open("/dev/kvm", O_RDWR).

Ioctl Encoding Nr One-line effect
KVM_GET_API_VERSION _IO(KVMIO, 0x00) 0x00 Returns KVM_API_VERSION = 12; abort if not 12
KVM_CREATE_VM _IO(KVMIO, 0x01) 0x01 Creates a VM; returns VM fd
KVM_GET_MSR_INDEX_LIST _IOWR(KVMIO, 0x02, struct kvm_msr_list) 0x02 Returns MSR indices the kernel handles
KVM_CHECK_EXTENSION _IO(KVMIO, 0x03) 0x03 Tests a KVM_CAP_*; returns 0 (absent) or positive (present)
KVM_GET_VCPU_MMAP_SIZE _IO(KVMIO, 0x04) 0x04 Returns bytes to mmap on each vCPU fd to obtain struct kvm_run
KVM_GET_SUPPORTED_CPUID _IOWR(KVMIO, 0x05, struct kvm_cpuid2) 0x05 Fills CPUID leaves KVM can emulate on this host
KVM_GET_MSR_FEATURE_INDEX_LIST _IOWR(KVMIO, 0x0a, struct kvm_msr_list) 0x0a Returns MSR indices with per-feature data

KVM_GET_API_VERSION

_IO(KVMIO, 0x00)   encoded: 0x0000AE00

Returns the integer constant KVM_API_VERSION, which is hard-coded to 12 and has been 12 since at least Linux 2.6.22; the kernel docs note that 2.6.20 and 2.6.21 reported earlier unsupported values. Any VMM that receives a value other than 12 must refuse to continue. There is no migration path; the value is frozen.

KVM_CREATE_VM

_IO(KVMIO, 0x01)   encoded: 0x0000AE01

The argument is a machine-type integer. Pass 0 for the standard VM type on x86; the KVM documentation says "you probably want to use 0." Returns a new VM fd. The new VM has no vCPUs and no memory; both require subsequent ioctls before KVM_RUN is valid.

KVM_CREATE_VM can return EINTR because its kernel path calls mm_take_all_locks(), which is CPU-intensive and interruptible. Firecracker (src/vmm/src/vstate/vm.rs) handles this by retrying up to five times with exponential back-off on EINTR. A VMM that does not retry will fail spuriously under load.

KVM_CHECK_EXTENSION

_IO(KVMIO, 0x03)   encoded: 0x0000AE03

Takes a KVM_CAP_* integer. Returns 0 if the capability is absent, or a positive integer if present. Some capabilities encode a meaningful count in the return value rather than simply returning 1: KVM_CAP_NR_MEMSLOTS, for example, returns the maximum number of memory slots the VM supports.

KVM_CHECK_EXTENSION can be issued on the system fd (global query) or on a VM fd (VM-specific query). The VM-level call is preferred because different VMs may present different capabilities. The VM-level form requires KVM_CAP_CHECK_EXTENSION_VM.

KVM_GET_VCPU_MMAP_SIZE

_IO(KVMIO, 0x04)   encoded: 0x0000AE04

Returns the byte size of the region that must be mmap(2)-ed on each vCPU fd to obtain struct kvm_run. The returned size is often larger than sizeof(struct kvm_run):

Both extra regions are included in the returned size. Always pass the full returned size as the length argument to mmap, not sizeof(struct kvm_run). Using the smaller value will silently truncate the mapping and produce undefined behavior when the kernel writes to the ring pages.


VM Ioctls

Issued on the fd returned by KVM_CREATE_VM.

Ioctl Encoding Nr One-line effect
KVM_CREATE_VCPU _IO(KVMIO, 0x41) 0x41 Creates a vCPU; returns vCPU fd
KVM_GET_DIRTY_LOG _IOW(KVMIO, 0x42, struct kvm_dirty_log) 0x42 Returns and clears the dirty-page bitmap for a memory slot
KVM_SET_USER_MEMORY_REGION _IOW(KVMIO, 0x46, struct kvm_userspace_memory_region) 0x46 Maps host virtual memory into guest physical address space
KVM_SET_TSS_ADDR _IO(KVMIO, 0x47) 0x47 Sets Intel VMX internal TSS guest physical address (Intel hosts only)
KVM_SET_IDENTITY_MAP_ADDR _IOW(KVMIO, 0x48, __u64) 0x48 Sets guest physical address of the identity-map page for real-mode entry (Intel)
KVM_SET_USER_MEMORY_REGION2 _IOW(KVMIO, 0x49, struct kvm_userspace_memory_region2) 0x49 Extended form; adds guest_memfd support for confidential VMs
KVM_CREATE_IRQCHIP _IO(KVMIO, 0x60) 0x60 Creates in-kernel PIC + IOAPIC + per-vCPU LAPIC on x86
KVM_IRQ_LINE _IOW(KVMIO, 0x61, struct kvm_irq_level) 0x61 Sets the level of an IRQ line on the in-kernel irqchip
KVM_SET_GSI_ROUTING _IOW(KVMIO, 0x6a, struct kvm_irq_routing) 0x6a Programs GSI-to-irqchip-pin or GSI-to-MSI routing table
KVM_IRQFD _IOW(KVMIO, 0x76, struct kvm_irqfd) 0x76 Binds an eventfd to a GSI so writes signal the interrupt
KVM_IOEVENTFD _IOW(KVMIO, 0x79, struct kvm_ioeventfd) 0x79 Fires an eventfd when the guest writes to a MMIO or PIO address
KVM_SET_CLOCK _IOW(KVMIO, 0x7b, struct kvm_clock_data) 0x7b Sets the VM's master clock
KVM_GET_CLOCK _IOR(KVMIO, 0x7c, struct kvm_clock_data) 0x7c Reads the VM's master clock
KVM_CREATE_DEVICE _IOWR(KVMIO, 0xe0, struct kvm_create_device) 0xe0 Creates an in-kernel device (e.g. GICv3 on arm64)

KVM_CREATE_VCPU

_IO(KVMIO, 0x41)   encoded: 0x0000AE41

The argument is the vCPU ID integer, which must be in [0, KVM_CAP_MAX_VCPU_ID). At most KVM_CAP_MAX_VCPUS vCPUs may be added to a VM. Returns a vCPU fd.

The new vCPU starts in an undefined register state. The VMM must call at minimum KVM_SET_SREGS and KVM_SET_REGS before the first KVM_RUN; the kernel does not pre-initialize registers to any documented reset state.

Sequencing constraint. If an in-kernel IRQ chip is desired, KVM_CREATE_IRQCHIP must be called before KVM_CREATE_VCPU. Each new vCPU automatically receives a wired local APIC only if the irqchip already exists at creation time. Creating vCPUs first and then calling KVM_CREATE_IRQCHIP is a silent misconfiguration: the vCPUs will lack LAPICs and interrupt delivery will be broken.

KVM_SET_USER_MEMORY_REGION

_IOW(KVMIO, 0x46, struct kvm_userspace_memory_region)

Requires capability KVM_CAP_USER_MEMORY. The struct is:

struct kvm_userspace_memory_region {
    __u32 slot;           /* bits 0–15: slot ID; bits 16–31: address space */
    __u32 flags;
    __u64 guest_phys_addr;
    __u64 memory_size;    /* bytes; 0 = delete the slot */
    __u64 userspace_addr; /* host virtual address; must span full memory_size */
};

The userspace_addr field is the crucial insight: the guest "physical" address space is backed by host virtual memory, typically obtained with mmap. EPT (on Intel) or NPT (on AMD) then walks from that host virtual address to the true host-physical page frame. The guest never sees or controls this translation layer.

Flags:

Flag Value Meaning
KVM_MEM_LOG_DIRTY_PAGES (1UL << 0) Enable dirty-page tracking for live migration
KVM_MEM_READONLY (1UL << 1) Guest writes produce KVM_EXIT_MMIO instead of writing through; requires KVM_CAP_READONLY_MEM
KVM_MEM_GUEST_MEMFD (1UL << 2) Back slot with a guest_memfd object for confidential VMs

Slots must not overlap within the same address space. An existing slot can be moved or have its flags changed but cannot be resized in place; set memory_size = 0 to delete the slot first, then re-add it. To enable 2 MiB large-page backing, the lower 21 bits of guest_phys_addr and userspace_addr should match so that the EPT walker can map a single 2 MiB page rather than 512 separate 4 KiB pages.

Firecracker allocates slot IDs sequentially via next_kvm_slot() and enforces the limit reported by KVM_CHECK_EXTENSION(KVM_CAP_NR_MEMSLOTS) on the VM fd.

KVM_SET_TSS_ADDR

_IO(KVMIO, 0x47)   encoded: 0x0000AE47

Required on Intel VMX hosts only. KVM reserves a three-page region (3 × 4 KiB = 12 KiB) in guest physical space as an internal Task State Segment used by VMX bookkeeping. The argument is the guest physical address of the first page. The address must be within the first 4 GiB of guest physical space and must not overlap any memory slot or MMIO range. Guest access to this region produces undefined behavior; the VMM must ensure the firmware and guest OS never map or use it.

KVM_CREATE_IRQCHIP

_IO(KVMIO, 0x60)   encoded: 0x0000AE60

Requires KVM_CAP_IRQCHIP. On x86, creates three in-kernel interrupt controller components: a virtual IOAPIC, a PIC master (i8259A), and a PIC slave (i8259A). Each subsequently created vCPU also receives a local APIC. The default GSI routing installed by KVM_CREATE_IRQCHIP routes GSIs 0–15 to both the PIC and the IOAPIC, and GSIs 16–23 to the IOAPIC only.

On arm64, KVM_CREATE_DEVICE with type KVM_DEV_TYPE_ARM_VGIC_V3 is now the preferred path for the GIC rather than this ioctl.

Must be called before KVM_CREATE_VCPU; see the sequencing note under that entry.

KVM_IRQFD

_IOW(KVMIO, 0x76, struct kvm_irqfd)

Requires KVM_CAP_IRQFD.

struct kvm_irqfd {
    __u32 fd;
    __u32 gsi;         /* Global System Interrupt number */
    __u32 flags;
    __u32 resamplefd;  /* used with KVM_IRQFD_FLAG_RESAMPLE */
    __u8  pad[16];
};

Writing the bound eventfd signals the GSI. The kernel resolves the GSI to an irqchip pin or an MSI vector using the routing table set with KVM_SET_GSI_ROUTING. Firecracker uses routing entries of type KVM_IRQ_ROUTING_IRQCHIP (value 1, IOAPIC pin) and KVM_IRQ_ROUTING_MSI (value 2). When KVM_CAP_MSI_DEVID is present, MSI entries may carry the KVM_MSI_VALID_DEVID flag.

Flag Value Meaning
KVM_IRQFD_FLAG_DEASSIGN (1 << 0) Remove an existing binding
KVM_IRQFD_FLAG_RESAMPLE (1 << 1) Level-triggered: re-arm after delivery via resamplefd

The value of KVM_IRQFD is that interrupt injection bypasses the VMM entirely on the hot path: a virtio device backend writes an eventfd, and the kernel delivers the interrupt to the guest without a round-trip through userspace.

KVM_IOEVENTFD

_IOW(KVMIO, 0x79, struct kvm_ioeventfd)

Requires KVM_CAP_IOEVENTFD.

struct kvm_ioeventfd {
    __u64 datamatch;
    __u64 addr;    /* MMIO or PIO address */
    __u32 len;     /* 0, 1, 2, 4, or 8 */
    __s32 fd;
    __u32 flags;
    __u8  pad[36];
};
Flag Value Meaning
KVM_IOEVENTFD_FLAG_DATAMATCH (1 << 0) Fire only when written value equals datamatch
KVM_IOEVENTFD_FLAG_PIO (1 << 1) addr is an I/O port, not MMIO
KVM_IOEVENTFD_FLAG_DEASSIGN (1 << 2) Remove an existing registration
KVM_IOEVENTFD_FLAG_VIRTIO_CCW_NOTIFY (1 << 3) s390 virtio-ccw specific

KVM_IOEVENTFD is the complementary mechanism to KVM_IRQFD: where KVM_IRQFD lets the host signal the guest without a userspace round-trip, KVM_IOEVENTFD lets the guest notify the host without one. When the guest writes to the registered MMIO or PIO address, the kernel fires the eventfd directly, never returning to userspace. This is the primary mechanism for virtio queue kick notification. With KVM_CAP_IOEVENTFD_ANY_LENGTH, len = 0 is valid and the kernel ignores the write width.


vCPU Ioctls

Issued on the fd returned by KVM_CREATE_VCPU.

Ioctl Encoding Nr One-line effect
KVM_RUN _IO(KVMIO, 0x80) 0x80 Enter guest; returns when exit requires VMM attention
KVM_GET_REGS _IOR(KVMIO, 0x81, struct kvm_regs) 0x81 Reads x86 GPRs, RIP, and RFLAGS
KVM_SET_REGS _IOW(KVMIO, 0x82, struct kvm_regs) 0x82 Writes x86 GPRs, RIP, and RFLAGS
KVM_GET_SREGS _IOR(KVMIO, 0x83, struct kvm_sregs) 0x83 Reads segment registers, descriptor tables, CR*, EFER
KVM_SET_SREGS _IOW(KVMIO, 0x84, struct kvm_sregs) 0x84 Writes segment registers, descriptor tables, CR*, EFER
KVM_GET_MSRS _IOWR(KVMIO, 0x88, struct kvm_msrs) 0x88 Reads one or more MSRs
KVM_SET_MSRS _IOW(KVMIO, 0x89, struct kvm_msrs) 0x89 Writes one or more MSRs
KVM_SET_CPUID2 _IOW(KVMIO, 0x90, struct kvm_cpuid2) 0x90 Programs CPUID leaves returned to the guest
KVM_GET_CPUID2 _IOWR(KVMIO, 0x91, struct kvm_cpuid2) 0x91 Reads the CPUID table currently set for this vCPU
KVM_GET_FPU _IOR(KVMIO, 0x8c, struct kvm_fpu) 0x8c Reads x87 FPU and SSE state
KVM_SET_FPU _IOW(KVMIO, 0x8d, struct kvm_fpu) 0x8d Writes x87 FPU and SSE state
KVM_GET_LAPIC _IOR(KVMIO, 0x8e, struct kvm_lapic_state) 0x8e Reads local APIC page (requires in-kernel APIC)
KVM_SET_LAPIC _IOW(KVMIO, 0x8f, struct kvm_lapic_state) 0x8f Writes local APIC page
KVM_ENABLE_CAP _IOW(KVMIO, 0xa3, struct kvm_enable_cap) 0xa3 Enables a per-vCPU capability
KVM_GET_VCPU_EVENTS _IOR(KVMIO, 0x9f, struct kvm_vcpu_events) 0x9f Reads pending exceptions, interrupts, and NMI state
KVM_SET_VCPU_EVENTS _IOW(KVMIO, 0xa0, struct kvm_vcpu_events) 0xa0 Writes pending exceptions, interrupts, and NMI state
KVM_GET_XSAVE _IOR(KVMIO, 0xa4, struct kvm_xsave) 0xa4 Reads XSAVE area (requires KVM_CAP_XSAVE)
KVM_SET_XSAVE _IOW(KVMIO, 0xa5, struct kvm_xsave) 0xa5 Writes XSAVE area
KVM_GET_XCRS _IOR(KVMIO, 0xa6, struct kvm_xcrs) 0xa6 Reads extended control registers including XCR0
KVM_SET_XCRS _IOW(KVMIO, 0xa7, struct kvm_xcrs) 0xa7 Writes extended control registers
KVM_GET_ONE_REG _IOW(KVMIO, 0xab, struct kvm_one_reg) 0xab Reads a single named register (arm64 and other arches); _IOW not _IOR — pointer-based, see detail below
KVM_SET_ONE_REG _IOW(KVMIO, 0xac, struct kvm_one_reg) 0xac Writes a single named register
KVM_KVMCLOCK_CTRL _IO(KVMIO, 0xad) 0xad Resets per-vCPU kvmclock state (needed after snapshot restore)
KVM_SET_SIGNAL_MASK _IOW(KVMIO, 0x8b, struct kvm_signal_mask) 0x8b Sets the signal mask active while this vCPU is running inside KVM_RUN

KVM_RUN

_IO(KVMIO, 0x80)   encoded: 0x0000AE80

No explicit argument. All communication between the VMM and the kernel about this vCPU's execution happens through the struct kvm_run region mapped at offset 0 of the vCPU fd. The region's size comes from KVM_GET_VCPU_MMAP_SIZE.

Returns 0 on clean exit, -1 on error. Notable errno values: EINTR when an unmasked signal is pending (the VMM must retry), ENOEXEC when no guest code is loaded, EPERM on a capability or mode error.

Key fields in struct kvm_run used around the call:

Field Type Direction Meaning
request_interrupt_window __u8 in (VMM writes) Causes exit when the guest is ready to accept an external interrupt
immediate_exit __u8 in (VMM writes) Writing 1 from any thread forces the current or next KVM_RUN to return EINTR immediately
exit_reason __u32 out (kernel writes) KVM_EXIT_* constant indicating why the guest exited
ready_for_interrupt_injection __u8 out 1 if an interrupt can be injected now
if_flag __u8 out Current value of RFLAGS.IF

Firecracker writes immediate_exit from a signal handler and follows the write with fence(Ordering::Release), ensuring the store is visible before any subsequent Acquire load on the vCPU thread. The pattern is necessary because the vCPU thread may be inside the kernel at the time the signal fires; the fence ensures the store is visible when the kernel samples the field on its way back to userspace.

The exit reason determines which union subfield of kvm_run is valid. See the exit-reason table below.

KVM_GET_REGS / KVM_SET_REGS

_IOR(KVMIO, 0x81, struct kvm_regs)   (GET)
_IOW(KVMIO, 0x82, struct kvm_regs)   (SET)

The x86_64 struct, from arch/x86/include/uapi/asm/kvm.h:

struct kvm_regs {
    __u64 rax, rbx, rcx, rdx;
    __u64 rsi, rdi, rsp, rbp;
    __u64 r8,  r9,  r10, r11;
    __u64 r12, r13, r14, r15;
    __u64 rip, rflags;
};

Firecracker programs these registers differently depending on boot protocol. For 64-bit Linux boot (BootProtocol::LinuxBoot):

Register Value Meaning
rip kernel entry address First instruction of the compressed kernel
rsp / rbp BOOT_STACK_POINTER Top of the boot stack
rsi ZERO_PAGE_START Pointer to boot params (Linux 64-bit boot ABI)
rflags 0x0000_0000_0000_0002 Bit 1 (Reserved=1) always set; interrupts disabled

For PVH boot (BootProtocol::PvhBoot):

Register Value Meaning
rip PVH entry address First instruction
rbx PVH_INFO_START Pointer to hvm_start_info struct
rflags 0x0000_0000_0000_0002 Same as above

KVM_GET_SREGS / KVM_SET_SREGS

_IOR(KVMIO, 0x83, struct kvm_sregs)   (GET)
_IOW(KVMIO, 0x84, struct kvm_sregs)   (SET)
struct kvm_sregs {
    struct kvm_segment cs, ds, es, fs, gs, ss, tr, ldt;
    struct kvm_dtable  gdt, idt;
    __u64 cr0, cr2, cr3, cr4, cr8;
    __u64 efer;
    __u64 apic_base;
    __u64 interrupt_bitmap[(256 + 63) / 64];
};

interrupt_bitmap is four __u64 words (since KVM_NR_INTERRUPTS = 256, so (256 + 63) / 64 = 4). At most one bit may be set. It represents an external interrupt acknowledged by the APIC but not yet injected into the CPU core.

Firecracker always calls KVM_GET_SREGS first, modifies the relevant fields, then calls KVM_SET_SREGS. Overwriting the struct wholesale would corrupt APIC state and any fields the kernel may have already populated.

For 64-bit Linux boot, Firecracker sets:

Field Value Meaning
cr0 \|= X86_CR0_PE (0x1) Protected Mode Enable
efer \|= EFER_LME \| EFER_LMA (0x100 | 0x400) Long Mode Enable + Long Mode Active
cr4 \|= X86_CR4_PAE (0x20) Physical Address Extension
cr3 0x9000 PML4 base; PDPTE at 0xa000; PDE at 0xb000

The GDT sits at guest physical 0x500 with four entries: NULL, CODE (0xa09b), DATA (0xc093), TSS (0x808b). The IDT sits at guest physical 0x520 with limit 7 (one 8-byte entry).

For PVH boot, the setup is different: cr0 = X86_CR0_PE | X86_CR0_ET = 0x11 (32-bit protected mode, no paging), cr4 = 0.

KVM_SET_CPUID2

_IOW(KVMIO, 0x90, struct kvm_cpuid2)
struct kvm_cpuid2 {
    __u32 nent;
    __u32 padding;
    struct kvm_cpuid_entry2 entries[];
};

struct kvm_cpuid_entry2 {
    __u32 function;   /* CPUID leaf (EAX input) */
    __u32 index;      /* CPUID sub-leaf (ECX input) */
    __u32 flags;
    __u32 eax, ebx, ecx, edx;
    __u32 padding[3];
};

The KVM_CPUID_FLAG_SIGNIFCANT_INDEX flag (value (1 << 0)) must be set on entries where the index field distinguishes sub-leaves, such as leaf 0xB (extended topology) and leaf 0xD (XSAVE state). Two other flag bits exist in the header (KVM_CPUID_FLAG_STATEFUL_FUNC = bit 1, KVM_CPUID_FLAG_STATE_READ_NEXT = bit 2) but are vestigial and not used in current VMM practice.

Two sequencing rules matter and both come directly from the kernel documentation. First: "Using KVM_SET_CPUID{,2} after KVM_RUN may cause guest instability." The CPUID table must be set before the first KVM_RUN on any vCPU. Second: all vCPUs in a VM should receive identical CPUID data unless the guest explicitly supports per-CPU CPUID differences; heterogeneous tables produce guest instability. KVM_SET_CPUID2 supersedes the older KVM_SET_CPUID (_IOW(KVMIO, 0x8a, struct kvm_cpuid)); use KVM_SET_CPUID2 on any modern kernel.

The typical workflow is to call KVM_GET_SUPPORTED_CPUID on the system fd to obtain the host-supported leaves, apply any per-guest masking or feature-hiding policy, and then call KVM_SET_CPUID2 on each vCPU.

KVM_SET_FPU

_IOW(KVMIO, 0x8d, struct kvm_fpu)

Firecracker's initialization sets two fields in struct kvm_fpu before the first KVM_RUN: fcw = 0x37f (the x87 FPU control word with all exceptions masked) and mxcsr = 0x1f80 (the SSE control/status register with all SIMD exceptions masked). Without this initialization the FPU starts in an unpredictable state and any guest floating-point operation may produce a spurious #MF or #XF exception that panics the guest kernel before it has printed its first console line.

KVM_GET_ONE_REG / KVM_SET_ONE_REG

_IOW(KVMIO, 0xab, struct kvm_one_reg)   (GET)
_IOW(KVMIO, 0xac, struct kvm_one_reg)   (SET)

Both use _IOW (write-to-kernel direction). This is intentional: struct kvm_one_reg contains a pointer to the caller's output buffer rather than embedding the data inline, so in both the GET and the SET case the struct itself travels only from userspace to the kernel. The pointer dereference is what transfers data in the GET direction.

Used primarily on arm64 and other non-x86 architectures where there is no flat struct kvm_regs. On x86 the flat register ioctls are preferred.

KVM_KVMCLOCK_CTRL

_IO(KVMIO, 0xad)   encoded: 0x0000AEAD

Resets the per-vCPU kvmclock state. A Firecracker fix merged in 2024 added a KVM_KVMCLOCK_CTRL call before resuming a vCPU from a snapshot to prevent the Linux guest watchdog from firing a soft-lockup warning on time-skewed restore. Without the reset, the guest kernel sees a large TSC jump and the watchdog fires within seconds of snapshot resume.


KVM_EXIT_* Exit Reason Codes

When KVM_RUN returns 0, kvm_run.exit_reason holds one of the following KVM_EXIT_* constants, defined in include/uapi/linux/kvm.h. The union subfields listed below are valid only for their respective exit reasons.

Constant Value Meaning
KVM_EXIT_UNKNOWN 0 Unrecognised exit; inspect kvm_run.hw.hardware_exit_reason
KVM_EXIT_EXCEPTION 1 x86 hardware exception
KVM_EXIT_IO 2 PIO in/out; see kvm_run.io subfield
KVM_EXIT_HYPERCALL 3 Hypercall
KVM_EXIT_DEBUG 4 Debug event
KVM_EXIT_HLT 5 Guest executed HLT with no pending work
KVM_EXIT_MMIO 6 MMIO access; see kvm_run.mmio subfield
KVM_EXIT_IRQ_WINDOW_OPEN 7 Interrupt window open (response to request_interrupt_window)
KVM_EXIT_SHUTDOWN 8 Guest shutdown (triple fault or ACPI power-off)
KVM_EXIT_FAIL_ENTRY 9 VMX/SVM VM-entry failure; see kvm_run.fail_entry subfield
KVM_EXIT_INTR 10 Host signal interrupted KVM_RUN (errno = EINTR); must retry
KVM_EXIT_SET_TPR 11 CR8 write (Task Priority Register)
KVM_EXIT_TPR_ACCESS 12 TPR access reporting
KVM_EXIT_NMI 16 NMI window
KVM_EXIT_INTERNAL_ERROR 17 KVM internal error; see kvm_run.internal.suberror
KVM_EXIT_SYSTEM_EVENT 24 System event; see kvm_run.system_event subfield
KVM_EXIT_X86_RDMSR 29 Unhandled RDMSR (user-space MSR handling enabled)
KVM_EXIT_X86_WRMSR 30 Unhandled WRMSR
KVM_EXIT_DIRTY_RING_FULL 31 Dirty ring full; VMM must drain before KVM_RUN again
KVM_EXIT_MEMORY_FAULT 39 Guest accessed memory with no valid mapping

The full list through Linux 6.x contains 44 values; only those the book exercises are shown here. The complete set is in include/uapi/linux/kvm.h.

Union Subfields for Common Exits

KVM_EXIT_IO (exit_reason = 2):

struct {
    __u8  direction;   /* 0 = IN (guest reading from port; VMM provides value), 1 = OUT (guest writing to port; VMM consumes value) */
    __u8  size;        /* 1, 2, or 4 bytes */
    __u16 port;
    __u32 count;
    __u64 data_offset; /* byte offset from start of kvm_run to data buffer */
} io;

KVM_EXIT_MMIO (exit_reason = 6):

struct {
    __u64 phys_addr;   /* guest physical address */
    __u8  data[8];     /* value read or to be written */
    __u32 len;         /* access size in bytes */
    __u8  is_write;
} mmio;

KVM_EXIT_FAIL_ENTRY (exit_reason = 9):

struct {
    __u64 hardware_entry_failure_reason;  /* VM-entry control failure code */
    __u32 cpu;
} fail_entry;

KVM_EXIT_INTERNAL_ERROR (exit_reason = 17):

struct {
    __u32 suberror;
    __u32 ndata;
    __u64 data[16];
} internal;

KVM_EXIT_SYSTEM_EVENT (exit_reason = 24):

struct {
    __u32 type;   /* KVM_SYSTEM_EVENT_SHUTDOWN=1, KVM_SYSTEM_EVENT_RESET=2,
                     KVM_SYSTEM_EVENT_CRASH=3 */
    __u32 ndata;
    union { __u64 flags; __u64 data[16]; };
} system_event;

Firecracker's VcpuExit match in src/vmm/src/vstate/vcpu.rs maps the exits as follows: MmioRead/MmioWrite dispatch to the device bus; IoIn/IoOut dispatch to the IO bus; X86Rdmsr/X86Wrmsr are handled for select MSRs (including the kvmclock MSR); SystemEvent maps KVM_SYSTEM_EVENT_SHUTDOWN and KVM_SYSTEM_EVENT_RESET to microVM lifecycle actions; FailEntry and InternalError are logged and returned as FaultyKvmExit.


The KVM_RUN Loop

The sequence diagram below traces the full VMM lifecycle from open to exit, placing the ioctls in their correct order and on their correct fd targets. Capability checks are shown before the operations they gate.

sequenceDiagram
    participant VMM
    participant KVM

    VMM->>KVM: open("/dev/kvm") → sysfd
    VMM->>KVM: ioctl(sysfd, KVM_GET_API_VERSION) → 12
    VMM->>KVM: ioctl(sysfd, KVM_CHECK_EXTENSION, KVM_CAP_USER_MEMORY)
    VMM->>KVM: ioctl(sysfd, KVM_CHECK_EXTENSION, KVM_CAP_IRQCHIP)
    VMM->>KVM: ioctl(sysfd, KVM_GET_VCPU_MMAP_SIZE) → mmap_size
    VMM->>KVM: ioctl(sysfd, KVM_CREATE_VM, 0) → vmfd
    VMM->>KVM: ioctl(vmfd, KVM_SET_TSS_ADDR, tss_gpa)
    VMM->>KVM: ioctl(vmfd, KVM_CREATE_IRQCHIP)
    VMM->>KVM: ioctl(vmfd, KVM_SET_USER_MEMORY_REGION, slot)
    VMM->>KVM: ioctl(vmfd, KVM_CREATE_VCPU, 0) → vcpufd
    VMM->>VMM: mmap(vcpufd, mmap_size) → kvm_run*
    VMM->>KVM: ioctl(vcpufd, KVM_SET_SREGS, sregs)
    VMM->>KVM: ioctl(vcpufd, KVM_SET_REGS, regs)
    VMM->>KVM: ioctl(vcpufd, KVM_SET_CPUID2, cpuid)
    loop until shutdown
        VMM->>KVM: ioctl(vcpufd, KVM_RUN)
        KVM-->>VMM: returns, kvm_run.exit_reason set
        VMM->>VMM: dispatch on exit_reason
    end

The diagram omits KVM_IRQFD, KVM_IOEVENTFD, and KVM_SET_GSI_ROUTING for clarity. Those three are issued on the VM fd after KVM_CREATE_IRQCHIP and before the first KVM_RUN, as Chapter 8 details.


Capability Check Reference

KVM_CHECK_EXTENSION takes a KVM_CAP_* integer. The table below covers the capabilities most relevant to x86 microVM construction. Capabilities marked as gating an ioctl must return a positive value before that ioctl is issued; issuing an ungated ioctl returns ENOTTY or EINVAL depending on the kernel version.

Capability Value Gates or meaning
KVM_CAP_IRQCHIP 0 KVM_CREATE_IRQCHIP, KVM_IRQ_LINE
KVM_CAP_HLT 1 HLT trap generates KVM_EXIT_HLT
KVM_CAP_USER_MEMORY 3 KVM_SET_USER_MEMORY_REGION
KVM_CAP_SET_TSS_ADDR 4 KVM_SET_TSS_ADDR
KVM_CAP_EXT_CPUID 7 KVM_GET_SUPPORTED_CPUID on system fd
KVM_CAP_NR_VCPUS 9 Recommended (soft) max vCPU count; assume 4 if absent
KVM_CAP_COALESCED_MMIO 15 Coalesced MMIO ring in vCPU mmap region
KVM_CAP_IRQFD 32 KVM_IRQFD
KVM_CAP_IOEVENTFD 36 KVM_IOEVENTFD
KVM_CAP_SET_IDENTITY_MAP_ADDR 37 KVM_SET_IDENTITY_MAP_ADDR
KVM_CAP_ADJUST_CLOCK 39 KVM_GET_CLOCK / KVM_SET_CLOCK
KVM_CAP_INTERNAL_ERROR_DATA 40 Extended data in kvm_run.internal on KVM_EXIT_INTERNAL_ERROR
KVM_CAP_XSAVE 55 KVM_GET_XSAVE / KVM_SET_XSAVE
KVM_CAP_XCRS 56 KVM_GET_XCRS / KVM_SET_XCRS
KVM_CAP_TSC_CONTROL 60 Per-vCPU TSC frequency scaling
KVM_CAP_GET_TSC_KHZ 61 KVM_GET_TSC_KHZ vCPU ioctl
KVM_CAP_MAX_VCPUS 66 Hard limit on vCPU count per VM
KVM_CAP_SPLIT_IRQCHIP 121 Split LAPIC / IOAPIC configuration
KVM_CAP_IMMEDIATE_EXIT 136 immediate_exit field in struct kvm_run

Three capabilities in the research note have uncertain numeric values (KVM_CAP_READONLY_MEM, KVM_CAP_NR_MEMSLOTS, KVM_CAP_MAX_VCPU_ID) and are omitted from the table rather than printed speculatively. Consult include/uapi/linux/kvm.h directly for their current numeric assignments.


rust-vmm Wrappers

Firecracker does not call ioctl(2) directly. It uses two crates from the rust-vmm project:

kvm-ioctls wraps the three fd types as Kvm (system fd), VmFd (VM fd), and VcpuFd (vCPU fd). VcpuFd::run() returns a VcpuExit enum whose variants include IoIn, IoOut, MmioRead, MmioWrite, Hlt, X86Rdmsr, X86Wrmsr, Hypercall, FailEntry, InternalError, and SystemEvent. The crate absorbs the EINTR-on-KVM_RUN and retries internally.

kvm-bindings provides #[repr(C)] Rust structs that mirror the C structs in include/uapi/linux/kvm.h and arch/x86/include/uapi/asm/kvm.h. When the kernel header changes struct layout, kvm-bindings is the single point that tracks the change; Firecracker's register-initialization code in src/vmm/src/arch/x86_64/regs.rs then compiles against the updated bindings.

A KVM_SET_SREGS call in Firecracker looks like vcpu.set_sregs(&sregs)? rather than ioctl(vcpufd, 0xC1C8AE84, &sregs). Error handling for the underlying ioctl failures — including the cases where KVM_RUN returns EINTR — is covered where kvm-ioctls surfaces those errors to Firecracker's vCPU thread loop.


Sources And Further Reading