Chapter 8: VM Exits Up Close
Every device access a guest makes — reading a byte from a UART register, writing a descriptor to a virtio queue notifier, asking CPUID what the CPU is — requires the hardware to stop the guest cold and hand control to the VMM. This is the VM exit, and it is both the mechanism that makes the whole architecture possible and the thing that every VMM spends the most effort avoiding.
What The Hardware Does
On Intel, the exit transfers the CPU from VMX non-root operation to VMX root operation. At the hardware level, this means the processor writes the cause to a 32-bit field in the VMCS called VM_EXIT_REASON. Bits 15:0 carry the basic exit reason — a numeric code identifying what caused the exit. Bit 31 is set only on VM-entry failures, not on exits from a running guest. Bits 28 and 29 carry special meanings for the Monitor Trap Flag and exits from VMX-root mode; they are zero in the common case.
The VMCS VM_EXIT_INSTRUCTION_LEN field holds the byte length of the instruction that caused the exit — 2 for CPUID, for instance. The EXIT_QUALIFICATION field carries per-reason detail. For a PIO exit, EXIT_QUALIFICATION encodes the port number, access size, direction, and whether it was a string instruction (INS/OUTS). For an EPT violation, it encodes the access type. This detail is what KVM reads in its exit handler before deciding what to do.
On AMD, the equivalent structure is the VMCB, and the exit writes a 64-bit exit code to the EXITCODE field, with additional context in EXITINFO1 and EXITINFO2. The numeric values differ — PIO exits are SVM_EXIT_IOIO = 0x07b, CPUID is SVM_EXIT_CPUID = 0x072, MSR access is SVM_EXIT_MSR = 0x07c, triple fault is SVM_EXIT_SHUTDOWN = 0x07f — but the conceptual roles are identical. For an IOIO exit, EXITINFO1 holds a bitmask: bit 0 is direction (0=OUT, 1=IN), bit 2 is a string instruction, bit 3 is a REP prefix, bits 6:4 encode the access size, and bits 31:16 hold the port number. KVM's io_interception() handler in arch/x86/kvm/svm/svm.c reads these masks and populates the same generic kvm_vcpu_io structure that the VMX path uses, so both architectures surface to userspace through an identical kvm_run.io struct.
A detail that surprises people who expect the hardware to be comprehensive: neither VMX nor SVM saves general-purpose registers automatically on exit. The hypervisor exit stub must save them before using them. VMX does save CR0, CR3, CR4, RSP, RIP, RFLAGS, segment descriptors, GDTR, IDTR, TR, and a configurable set of MSRs — but RAX through R15 are the VMM's responsibility. The host register state is restored from the VMCS host-state area, not from general-purpose register saves.
How KVM Routes Exits
KVM's kvm_arch_vcpu_ioctl_run() in arch/x86/kvm/x86.c calls vcpu_enter_guest(), which issues either VMLAUNCH or VMRESUME. When the guest exits, KVM reads the exit reason and dispatches to a per-reason handler. The return convention of that handler determines what happens next: a return value of 1 means KVM handled it in-kernel and re-enters the guest immediately; a return value of 0 means KVM fills kvm_run.exit_reason and returns from the KVM_RUN ioctl to userspace. The VMM process never wakes up for a return-1 exit.
Several exit categories never reach userspace. EPT violations on RAM — a guest touching a mapped guest-physical address whose EPT entry does not yet exist — are resolved by kvm_mmu_page_fault(), which installs the mapping and resumes. The VMM never sees these; they are the hardware page-fault mechanism working normally.
CPUID causes an unconditional VMX exit (basic exit reason 10). KVM handles it entirely in-kernel by consulting the per-vCPU table set by KVM_SET_CPUID2. There is no KVM_EXIT_CPUID in the KVM uAPI.
MSR accesses are governed by the MSR bitmap — a 4 KB VMCS field covering MSR addresses 0x0–0x1fff and 0xc0000000–0xc0001fff. If an MSR's bit is clear in the bitmap, the guest reads or writes it without a VM exit at all. For MSRs outside that range that do exit, KVM's kvm_emulate_rdmsr() and kvm_emulate_wrmsr() handle the ones KVM owns in-kernel: IA32_TSC, IA32_APIC_BASE, the KVM paravirt MSR range 0x4b564d00–0x4b564d08, and many others.
HLT, when the "HLT exiting" bit (bit 7 of the processor-based VM-execution controls) is set, sends the vCPU thread to kvm_vcpu_block() in virt/kvm/kvm_main.c. With halt polling active, the thread spins for up to halt_poll_ns nanoseconds checking for a pending wakeup; an interrupt arriving in that window resumes the guest with no scheduler round-trip.
In-kernel irqchip accesses — when KVM_CREATE_IRQCHIP and KVM_CREATE_PIT2 are used — route APIC-register accesses and interrupt-controller MMIO through the KVM virtual device layer via kvm_io_bus, never surfacing to userspace.
The exits that do reach userspace are defined by constants in include/uapi/linux/kvm.h, part of the stable KVM ABI. The x86-relevant subset:
| Value | Constant | When it fires |
|---|---|---|
| 0 | KVM_EXIT_UNKNOWN |
Hardware exit reason KVM did not recognize |
| 2 | KVM_EXIT_IO |
PIO IN/OUT to a port with no in-kernel owner |
| 5 | KVM_EXIT_HLT |
Guest HLT when not handled in-kernel |
| 6 | KVM_EXIT_MMIO |
MMIO to a GPA not backed by RAM or an in-kernel device |
| 8 | KVM_EXIT_SHUTDOWN |
Triple fault |
| 9 | KVM_EXIT_FAIL_ENTRY |
Hardware refused VM entry |
| 17 | KVM_EXIT_INTERNAL_ERROR |
KVM subsystem error |
| 29 | KVM_EXIT_X86_RDMSR |
RDMSR delegated to userspace (requires KVM_CAP_X86_USER_SPACE_MSR) |
| 30 | KVM_EXIT_X86_WRMSR |
WRMSR delegated to userspace (requires KVM_CAP_X86_USER_SPACE_MSR) |
The kvm_run Structure
The VMM maps the struct kvm_run page once, before the first KVM_RUN:
int mmap_size = ioctl(kvm_fd, KVM_GET_VCPU_MMAP_SIZE, 0);
struct kvm_run *run = mmap(NULL, mmap_size, PROT_READ|PROT_WRITE,
MAP_SHARED, vcpu_fd, 0);
This page persists across KVM_RUN calls. The VMM reads it after each ioctl returns. The abbreviated layout, from include/uapi/linux/kvm.h:
struct kvm_run {
/* VMM writes before KVM_RUN */
__u8 request_interrupt_window; /* exit when guest IF opens */
__u8 immediate_exit; /* force exit after one instruction */
__u8 padding1[6];
/* KVM writes after exit */
__u32 exit_reason; /* KVM_EXIT_* */
__u8 ready_for_interrupt_injection;
__u8 if_flag;
__u16 flags;
/* in/out */
__u64 cr8;
__u64 apic_base;
union {
struct { /* KVM_EXIT_IO */
__u8 direction; /* 0 = IN, 1 = OUT */
__u8 size; /* 1, 2, or 4 bytes */
__u16 port;
__u32 count; /* repetition count for REP INS/OUTS */
__u64 data_offset; /* offset from start of kvm_run to data */
} io;
struct { /* KVM_EXIT_MMIO */
__u64 phys_addr; /* guest physical address */
__u8 data[8]; /* fill on read, read on write */
__u32 len;
__u8 is_write;
} mmio;
struct { /* KVM_EXIT_X86_RDMSR / KVM_EXIT_X86_WRMSR */
__u8 error; /* out: 0 = ok, 1 = inject #GP */
__u8 pad[7];
__u32 reason; /* KVM_MSR_EXIT_REASON_* bitmask */
__u32 index; /* MSR address (RCX) */
__u64 data; /* RDMSR fill / WRMSR value */
} msr;
struct { /* KVM_EXIT_FAIL_ENTRY */
__u64 hardware_entry_failure_reason;
__u32 cpu;
} fail_entry;
struct { /* KVM_EXIT_INTERNAL_ERROR */
__u32 suberror;
__u32 ndata;
__u64 data[16];
} internal;
/* KVM_EXIT_HLT: exit_reason alone is the signal; no union member */
/* KVM_EXIT_SHUTDOWN: no union member */
char padding[256];
};
__u64 kvm_valid_regs;
__u64 kvm_dirty_regs;
};
One detail that trips people: the KVM_EXIT_IO data is not stored inside the struct. The io.data_offset field is a byte offset from the start of the kvm_run page. The VMM accesses it as (char *)run + run->io.data_offset. The separation exists because REP string I/O (INS, OUTS) can move more data than fits in eight bytes, and inline storage would overflow other struct fields.
Servicing Each Exit
PIO: KVM_EXIT_IO
When a guest executes an IN or OUT instruction to a port that is not in-kernel owned, KVM fills run->io and returns. The VMM reads run->io.port, run->io.direction, run->io.size, and run->io.count. For an OUT (guest write, direction == 1), the data is already in the buffer at data_offset. For an IN (guest read, direction == 0), the VMM must fill the buffer before returning to KVM_RUN. The canonical pattern, from the LWN KVM tutorial:
case KVM_EXIT_IO:
if (run->io.direction == KVM_EXIT_IO_OUT && run->io.port == 0x3f8)
putchar(*((char *)run + run->io.data_offset));
break;
Port 0x3f8 is the first COM port, the base address of the 16550A UART — where any Linux guest sends its early boot messages. That one line is the complete device model for a serial console in a toy VMM.
In Firecracker, the dispatch is more structured. In src/vmm/src/arch/x86_64/vcpu.rs, VcpuExit::IoIn(addr, data) and VcpuExit::IoOut(addr, data) are delivered to pio_bus.read() and pio_bus.write(). The pio_bus is a sorted map of address ranges to registered device handlers. If the port has no registered handler, the read data is zero-filled with a warn! log and the exit returns Handled — the guest sees zeros rather than a fault.
MMIO: KVM_EXIT_MMIO
MMIO exits arise when the guest accesses a guest-physical address that is not backed by a RAM slot and has no in-kernel device handler. On Intel, the EPT records the access as a violation (basic exit reason 48); on AMD, a nested page fault (SVM_EXIT_NPF = 0x400) on an address without an NPT entry serves the same role. KVM identifies both as MMIO from the absent mapping and surfaces them as KVM_EXIT_MMIO.
The VMM reads run->mmio.phys_addr, run->mmio.len, and run->mmio.is_write. For a guest write, run->mmio.data already contains the value. For a guest read, the VMM fills run->mmio.data before returning.
Firecracker handles this in src/vmm/src/vstate/vcpu.rs:
VcpuExit::MmioRead(addr, data) => {
data.fill(0);
if let Some(mmio_bus) = &peripherals.mmio_bus {
mmio_bus.read(addr, data)?;
METRICS.vcpu.exit_mmio_read.inc();
}
Ok(VcpuEmulation::Handled)
}
VcpuExit::MmioWrite(addr, data) => {
if let Some(mmio_bus) = &peripherals.mmio_bus {
mmio_bus.write(addr, data)?;
METRICS.vcpu.exit_mmio_write.inc();
}
Ok(VcpuEmulation::Handled)
}
Firecracker maintains two separate Bus structs: mmio_bus for virtio device config-space accesses and pio_bus for legacy PIO devices (the UART, the i8042). Each is a sorted map of address ranges to device handlers, and the dispatch is a binary search. Every MMIO access the guest makes to a virtio device register — DeviceID, DeviceFeatures, QueueNotify — comes through this path.
HLT: KVM_EXIT_HLT
KVM_EXIT_HLT (value 5) carries no data fields; the exit reason is the complete signal. Under normal KVM configuration, HLT is handled in-kernel by kvm_vcpu_block(), which parks the vCPU thread until a wakeup event arrives. The exit surfaces to userspace only when the in-kernel handler cannot manage it.
Firecracker treats VcpuExit::Hlt as an unhandled exit — it appears as UnhandledKvmExit("Hlt") in the generic handle_kvm_exit() in src/vmm/src/vstate/vcpu.rs. This is deliberate: Firecracker expects guests to use the ACPI power-off path (KVM_EXIT_SYSTEM_EVENT) for orderly shutdown, not HLT. A bare HLT in a Firecracker guest that somehow escapes the in-kernel handler would be treated as an error condition.
CPUID: No Userspace Exit
CPUID causes an unconditional VM exit on both Intel (basic exit reason 10) and AMD (SVM_EXIT_CPUID = 0x072), but KVM handles it entirely in-kernel. There is no KVM_EXIT_CPUID in the uAPI. The VMM calls KVM_SET_CPUID2 once per vCPU before the first KVM_RUN to pre-load the CPUID table KVM consults on each exit.
The KVM paravirt signature, placed at synthetic leaf 0x40000000, reads "KVMKVMKVM\0" across EBX, ECX, EDX. Leaf 0x40000001 carries feature bits: bit 0 is KVM_FEATURE_CLOCKSOURCE, bit 3 is KVM_FEATURE_CLOCKSOURCE2, bit 5 is KVM_FEATURE_STEAL_TIME, bit 6 is KVM_FEATURE_PV_EOI, and bit 24 is KVM_FEATURE_CLOCKSOURCE_STABLE_BIT. A guest that finds this signature at leaf 0x40000000 knows it is running under KVM and can opt into the paravirt interfaces those bits advertise.
Firecracker does not simply pass through the host CPUID to the guest. It applies a CpuidLeafModifier template — a set of bitmask operations per leaf — and then runs a normalization pass in cpuid/normalize.rs that enforces correct topology in leaf 0x1 (the APIC ID and HTT bit), fills leaf 0xB extended topology, and clears or sets specific feature bits for reproducibility across hosts. The goal is a guest that sees the same CPU regardless of which physical host it lands on — essential when snapshots may be restored on different hardware.
MSR Exits: KVM_EXIT_X86_RDMSR and KVM_EXIT_X86_WRMSR
Most MSR accesses never reach userspace. KVM's MSR bitmap suppresses exits for MSRs it handles fully — IA32_TSC, IA32_APIC_BASE, the KVM PV MSR range, and dozens of others. For MSRs outside that range that do exit, KVM normally injects #GP into the guest.
Delegating unknown MSR accesses to userspace requires two capabilities set explicitly: KVM_CAP_X86_USER_SPACE_MSR (capability 188) and optionally KVM_CAP_X86_MSR_FILTER (capability 189). With KVM_CAP_X86_USER_SPACE_MSR enabled, unknown or filtered MSRs produce KVM_EXIT_X86_RDMSR (29) or KVM_EXIT_X86_WRMSR (30). The VMM then reads run->msr.index (the MSR address from RCX), fills run->msr.data for a read, reads run->msr.data for a write, and sets run->msr.error = 0 for success or 1 to inject #GP into the guest.
The KVM_MSR_EXIT_REASON_* flags describe why the exit occurred:
KVM_MSR_EXIT_REASON_INVAL (1 << 0) /* invalid MSR or reserved bits */
KVM_MSR_EXIT_REASON_UNKNOWN (1 << 1) /* KVM has no handler for this MSR */
KVM_MSR_EXIT_REASON_FILTER (1 << 2) /* blocked by KVM_X86_SET_MSR_FILTER */
Firecracker does not use KVM_CAP_X86_USER_SPACE_MSR and does not handle KVM_EXIT_X86_RDMSR or KVM_EXIT_X86_WRMSR in its run loop. MSRs are configured at boot via KVM_SET_MSRS using CPU template RegisterValueFilter entries, and KVM handles them from there.
The KVM paravirt MSR range 0x4b564d00–0x4b564d08 is handled in-kernel. The addresses and their purposes:
| MSR address | Purpose |
|---|---|
0x4b564d00 |
MSR_KVM_WALL_CLOCK_NEW — guest wallclock GPA |
0x4b564d01 |
MSR_KVM_SYSTEM_TIME_NEW — per-vCPU monotonic time GPA |
0x4b564d02 |
MSR_KVM_ASYNC_PF_EN — async page fault control |
0x4b564d03 |
MSR_KVM_STEAL_TIME — vCPU steal-time GPA |
0x4b564d04 |
MSR_KVM_EOI_EN — PV EOI control |
0x4b564d05 |
MSR_KVM_POLL_CONTROL — host halt-polling enable/disable |
0x4b564d06 |
MSR_KVM_ASYNC_PF_INT — APF "page ready" interrupt vector |
0x4b564d07 |
MSR_KVM_ASYNC_PF_ACK — APF acknowledgement |
0x4b564d08 |
MSR_KVM_MIGRATION_CONTROL — live migration permission |
The older addresses 0x11 (MSR_KVM_WALL_CLOCK) and 0x12 (MSR_KVM_SYSTEM_TIME) are deprecated; guests should use the 0x4b564d00 variants when bit 3 of leaf 0x40000001 is set.
Shutdown: KVM_EXIT_SHUTDOWN
A guest triple fault delivers KVM_EXIT_SHUTDOWN (value 8). No fields in the kvm_run union carry detail; the exit reason alone is the signal. A triple fault occurs when the CPU encounters a fault while trying to deliver another fault — typically a fault with no valid IDT handler and then another fault while processing that. The hardware has nowhere to go. On AMD, this is SVM_EXIT_SHUTDOWN = 0x07f.
The VMM must terminate the vCPU thread. Re-entering the guest after a shutdown exit is undefined behavior. Firecracker tears down the vCPU and reports the event. In a properly functioning guest, shutdown flows through ACPI: the guest writes a power-off request, the VMM receives KVM_EXIT_SYSTEM_EVENT, and the vCPU loop terminates cleanly. A triple fault is an aberration — a guest kernel crash or a bug that cascades past all fault handlers.
Internal Error: KVM_EXIT_INTERNAL_ERROR
KVM_EXIT_INTERNAL_ERROR (value 17) signals that KVM itself encountered a problem. The run->internal.suberror field carries a sub-code. With KVM_CAP_EXIT_ON_EMULATION_FAILURE (capability 204) enabled, instruction emulation failures produce suberror KVM_INTERNAL_ERROR_EMULATION; the detailed sub-struct then contains the VM exit reason, exit_info1 and exit_info2, and up to 15 bytes of the failing instruction. This is the right way to handle emulation failures: the VMM can log the instruction bytes and exit cleanly rather than guessing what went wrong.
Without that capability, KVM injects #UD (invalid opcode) silently. This is a hard bug to diagnose — the guest crashes on a valid instruction with no indication from KVM that emulation failed.
The Canonical Run Loop
The exit dispatch structure is the same across all VMMs. Here is the loop stripped to its essentials:
for (;;) {
ioctl(vcpu_fd, KVM_RUN, NULL); /* 0xAE80 = _IO(0xAE, 0x80) */
switch (run->exit_reason) {
case KVM_EXIT_HLT:
return 0;
case KVM_EXIT_IO:
/* direction, size, port, count: run->io.* */
/* data: (char *)run + run->io.data_offset */
break;
case KVM_EXIT_MMIO:
/* phys_addr, len, is_write: run->mmio.* */
/* for reads: fill run->mmio.data before returning */
break;
case KVM_EXIT_SHUTDOWN:
abort();
case KVM_EXIT_INTERNAL_ERROR:
/* run->internal.suberror for detail */
abort();
}
}
The KVM_RUN ioctl encodes as _IO(KVMIO, 0x80) where KVMIO = 0xAE, giving the raw value 0xAE80. It takes no parameter.
What An Exit Costs
No Intel or AMD specification publishes a cycle count for VM exit and re-entry. All available measurements are empirical, and they vary substantially across microarchitectures and workloads. VMXbench (utshina) measured roughly 330 cycles for the exit and 294 cycles for the entry on Skylake-era hardware, using a VMCALL-triggered exit and immediate VMRESUME. Measurements on Sandy Bridge-E (Core i7-3960X) with a custom KVM patch found round-trip latencies of roughly 1,600–1,700 cycles at the 95th percentile with pinned vCPUs. The ACRN project documentation characterizes ICR MSR accesses as approximately 3–4 microseconds.
The variance is not surprising when you consider what the hardware must do. The VMCS save-and-restore writes guest CR0, CR3, CR4, RSP, RIP, RFLAGS, segment descriptors, and any MSRs in the VM-exit MSR-save list, then reads the host values from the VMCS host-state area. TLB entries tagged to the VPID or EPTP may be flushed unless VPID is in use and neither CR3 nor EPTP changed. The exit returns the CPU to VMX-root code — KVM's exit handler — that likely evicted from L1 and L2 while the guest was running. That cold-cache penalty is real and workload-dependent.
The Spectre v2 mitigations added another cost: with retpolines enabled, indirect calls through KVM's vmx_exit_handlers[] function-pointer table each take the retpoline penalty. Andrea Arcangeli's "KVM Monolithic" restructuring (around 2019), merged by KVM maintainer Paolo Bonzini, reduced this by eliminating indirect dispatch across the kvm.ko/kvm-intel.ko module boundary, yielding double-digit percentage improvement on CPUID-heavy benchmarks with default mitigations enabled. The QEMU team documented this work in a 2019 blog post titled "Micro-Optimizing KVM VM-Exits."
At 300–1,700 cycles per round-trip and a guest capable of billions of instructions per second, one exit per microsecond (10⁶ exits/s) consumes roughly 0.1–0.3% of host CPU just in exit-and-entry overhead. The actual cost is higher because device emulation runs in that window. A VMM that handles 100 exits per guest millisecond is spending a measurable fraction of a core just on overhead — before a single byte of device emulation executes.
Intel APICv (controlled by the "virtual-interrupt delivery" and "APIC-register virtualization" VM-execution controls, widely cited as introduced around the Ivy Bridge-EP / Haswell era) eliminates a large class of exits entirely. Historically, guest kernel LAPIC MMIO accesses and x2APIC WRMSR instructions were among the highest-frequency exit categories; APICv allows the processor to deliver virtual interrupts and update the virtual APIC page without exiting to VMX root mode at all. AMD AVIC provides the equivalent on AMD hardware.
Eliminating Exits: ioeventfd and irqfd
The two highest-frequency exit categories in a typical Firecracker microVM are virtio queue notifications (MMIO writes to the QueueNotify register at virtio MMIO offset 0x050) and device interrupt injection. Both can be made zero-exit.
KVM_IOEVENTFD (capability KVM_CAP_IOEVENTFD) registers an eventfd file descriptor against a guest MMIO or PIO address. When the guest writes to that address, KVM signals the eventfd and resumes the guest — the KVM_RUN ioctl never returns to userspace for that write. The registration struct:
struct kvm_ioeventfd {
__u64 datamatch; /* optional: match only this write value */
__u64 addr; /* MMIO or PIO address */
__u32 len; /* access width */
__s32 fd; /* eventfd to signal */
__u32 flags;
/* KVM_IOEVENTFD_FLAG_PIO -- PIO (default: MMIO) */
/* KVM_IOEVENTFD_FLAG_DATAMATCH -- filter by datamatch value */
/* KVM_IOEVENTFD_FLAG_DEASSIGN -- remove this registration */
};
Note:
KVM_IOEVENTFDandKVM_IRQFDare VM-level ioctls on the VM fd. They are typically called during device setup, before the guest has a chance to write to the target address; they do not require the vCPU threads to be stopped.
Firecracker registers each virtio device's queue notify address via KVM_IOEVENTFD during device initialization. A guest write to QueueNotify signals the eventfd in-kernel — no exit ever reaches Firecracker's vCPU thread for the virtio data-plane hot path. Unregistered MMIO addresses still produce KVM_EXIT_MMIO.
KVM_IRQFD (capability KVM_CAP_IRQFD) registers an eventfd paired with a guest GSI routing entry. When the host signals the eventfd — a device backend completing a DMA transfer, a network packet arriving — KVM injects the corresponding virtual interrupt into the guest without any userspace round-trip.
Together, ioeventfd and irqfd make the virtio data plane a kernel-to-kernel path. The guest writes the queue notifier (in-kernel ioeventfd delivery), the device thread processes buffers, and signals the guest via irqfd injection. No userspace VMM code runs in the steady state. This is why Firecracker can sustain high-throughput network and disk I/O at low vCPU overhead: the hot path does not touch the vCPU run loop.
Halt Polling
When a guest HLTs, kvm_vcpu_block() in virt/kvm/kvm_main.c does not immediately yield the vCPU thread to the Linux scheduler. It spins first, for up to halt_poll_ns nanoseconds, checking for a pending wakeup event. If an interrupt arrives during that window — the common case in a guest waiting for a virtio response — the guest resumes without the cost of a scheduler round-trip, which itself runs several microseconds.
The polling interval adapts dynamically. On a successful poll (interrupt arrived during the window), the next window grows by a factor of halt_poll_ns_grow (default: 2), starting from halt_poll_ns_grow_start (default: 10,000 ns). On a failed poll (the vCPU thread yielded anyway), the window shrinks by halt_poll_ns_shrink (default: 2). The ceiling is halt_poll_ns. Check /sys/module/kvm/parameters/halt_poll_ns on the target host to see the current ceiling — the default varies across kernel versions.
The trade-off is CPU consumption on the host. A guest that halts frequently but wakes quickly pays nothing extra if the interrupt arrives within the polling window; it pays nothing if the polling window is zero. A guest that halts and stays halted for hundreds of microseconds wastes the entire polling window on the host before finally yielding. halt_poll_ns is a knob, not a fixed cost.
Instruction Emulation
Most VM exits are clean: a discrete instruction ran to completion, the exit reason names it, and the VMM can service the exit and resume. But some exits occur mid-instruction, or after an instruction the hardware could not complete. For these, KVM uses the software emulator in arch/x86/kvm/emulate.c.
The emulator:
- Decodes the instruction at
GUEST_RIPfrom guest memory. - Runs the emulated operation against KVM's model of guest register state.
- Updates guest GPRs, flags, and RIP.
This is not a toy decoder. arch/x86/kvm/emulate.c handles the full x86 instruction encoding — REX prefixes, VEX extensions, operand sizes, memory addressing modes — because any instruction that triggered an exit might need emulation. The emulator is the code path KVM takes when EPT reports a write to a GPA that turns out to be MMIO, and the hardware's partial instruction needs completing in software.
Emulation failures surface as KVM_EXIT_INTERNAL_ERROR with suberror KVM_INTERNAL_ERROR_EMULATION when KVM_CAP_EXIT_ON_EMULATION_FAILURE (capability 204) is enabled. The sub-struct carries the exit reason, exit_info1 and exit_info2, and up to 15 bytes of the failing instruction — enough to identify the problem. Without that capability, KVM injects #UD silently, and the guest crashes on a valid instruction with no diagnostic output from the VMM.
A concrete example of what goes wrong without careful emulation: a bug fixed in Linux ~5.15 (by Hou Wenlong) found that x86_emulate_instruction() did not check vcpu->arch.complete_userspace_io after RDMSR or WRMSR emulation through the software emulator path. The instruction was advanced past (RIP incremented) without the userspace exit being delivered. The guest's MSR read would silently return stale register values. The fix adds a check before advancing RIP.
Observability
The kvm_stat tool at tools/kvm/kvm_stat/kvm_stat in the kernel tree reads per-VM and per-vCPU exit counters from debugfs at /sys/kernel/debug/kvm/ and prints a rolling summary by exit type. Running it against a live Firecracker instance shows which exit types dominate: on a boot-heavy workload, io_instruction and mmio exits; on a network-heavy workload, external_interrupt entries when IRQ injection is in use.
For per-exit profiling, the kvm_exit and kvm_mmio kernel tracepoints feed perf kvm stat, which aggregates exit counts and average latencies by exit type across all VMs on the host. A trace that shows thousands of MMIO exits per second to addresses that should be ioeventfd-registered suggests a device setup problem worth investigating.
Firecracker exposes its own counters. METRICS.vcpu.exit_mmio_read, exit_mmio_write, exit_io_in, and exit_io_out increment in the exit dispatch paths and are surfaced via Firecracker's metrics endpoint. These are the first numbers to check when a Firecracker guest shows unexpectedly high vCPU CPU consumption.
Chapter 9 builds from here — a minimal VMM whose run loop has exactly these case arms, each one earned.
Sources And Further Reading
- Linux KVM uAPI header (
kvm.h): exit reason constants (KVM_EXIT_*),struct kvm_run,struct kvm_ioeventfd, capability numbers. https://github.com/torvalds/linux/blob/master/include/uapi/linux/kvm.h - Linux SVM uAPI header (
svm.h): AMD exit codes (SVM_EXIT_IOIO = 0x07b,SVM_EXIT_MSR = 0x07c,SVM_EXIT_SHUTDOWN = 0x07f,SVM_EXIT_NPF = 0x400). https://github.com/torvalds/linux/blob/master/arch/x86/include/uapi/asm/svm.h - Linux SVM kernel header (
asm/svm.h): IOIO bitmask definitions (SVM_IOIO_TYPE_MASK,SVM_IOIO_STR_MASK,SVM_IOIO_SIZE_MASK). https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/svm.h - KVM API documentation: definitive reference for all ioctls, capabilities, and
kvm_runfields. https://docs.kernel.org/virt/kvm/api.html - KVM halt polling documentation:
halt_poll_ns, growth and shrink parameters, the polling algorithm. https://docs.kernel.org/virt/kvm/halt-polling.html - KVM CPUID documentation: synthetic leaf
0x40000000signature,0x40000001feature bits,KVM_FEATURE_CLOCKSOURCE. https://docs.kernel.org/virt/kvm/x86/cpuid.html - KVM MSR documentation: paravirt MSR range
0x4b564d00–0x4b564d08, deprecated MSRs0x11/0x12. https://docs.kernel.org/virt/kvm/x86/msr.html - "Using the KVM API," LWN.net: the canonical C walkthrough of the KVM run loop, including the serial port
putcharpattern. https://lwn.net/Articles/658511/ - LWN article on
KVM_CAP_X86_USER_SPACE_MSRand the MSR userspace exit mechanism. https://lwn.net/Articles/1032708/ - Firecracker
vstate/vcpu.rs: the exit dispatch loop, MMIO handler, HLT handling. https://github.com/firecracker-microvm/firecracker/blob/main/src/vmm/src/vstate/vcpu.rs - Firecracker
arch/x86_64/vcpu.rs: PIO dispatch viapio_bus. https://github.com/firecracker-microvm/firecracker/blob/main/src/vmm/src/arch/x86_64/vcpu.rs - Firecracker CPU templates documentation:
CpuidLeafModifier,RegisterValueFilter, template application. https://github.com/firecracker-microvm/firecracker/blob/main/docs/cpu_templates/cpu-templates.md - Firecracker CPUID normalization documentation: leaf
0x1topology, leaf0xBextended topology, normalization pass incpuid/normalize.rs. https://github.com/firecracker-microvm/firecracker/blob/main/docs/cpu_templates/cpuid-normalization.md - intel/haxm
vmx.h: PIO exit qualification bit layout (bits 2:0 access size, bit 3 direction, bits 31:16 port number). https://github.com/intel/haxm/blob/master/core/include/vmx.h - rust-vmm
kvm-ioctlsvCPU implementation:KVM_RUNencoding and vCPU exit mapping. https://github.com/rust-vmm/kvm/blob/main/kvm-ioctls/src/ioctls/vcpu.rs - QEMU blog, "Micro-Optimizing KVM VM-Exits" (2019): Spectre v2 retpoline overhead on the exit path, KVM Monolithic restructuring, double-digit performance recovery. https://www.qemu.org/2019/11/15/micro-optimizing-kvm-vmexits/
- Intel SDM Vol. 3C, Table C-1: basic exit reason codes,
VM_EXIT_REASONfield layout. https://cdrdv2-public.intel.com/812396/326019-sdm-vol-3c.pdf kvm_stattool: per-VM and per-vCPU exit counters from/sys/kernel/debug/kvm/. https://github.com/torvalds/linux/blob/master/tools/kvm/kvm_stat/kvm_stat