Chapter 2: What A Virtual Machine Actually Is
Run a Linux container and a Linux VM on the same host and both feel like isolated environments: each has its own filesystem, its own process table, its own IP address. But one of them is issuing system calls directly into the host kernel's dispatch table, and the other is not. That difference is not cosmetic. It determines the threat model, the boot time, the overhead budget, and what the workload can do to the host. The boundary is drawn in hardware, at the privilege ring level the CPU enforces.
The Popek-Goldberg Theorem
In July 1974, Gerald Popek and Robert Goldberg published "Formal Requirements for Virtualizable Third Generation Architectures" in Communications of the ACM 17(7), pp. 412–421. The paper is short by modern standards, but it gave the field its vocabulary, and the problem it addressed was not abstract.
The question Popek and Goldberg asked was: what does a computer architecture need to guarantee so that a Virtual Machine Monitor (VMM) can be built for it? They wanted necessary and sufficient conditions, not a design sketch. A VMM must satisfy three properties:
- Equivalence (fidelity): a program running under the VMM behaves essentially identically to running on bare hardware, modulo timing and resource availability.
- Resource control (safety): the VMM maintains complete control over all virtualized resources. Guest code cannot directly access or modify resources the VMM has not granted.
- Efficiency (performance): a statistically dominant fraction of instructions executes without VMM intervention.
Efficiency is the property that constrains architecture design. If the VMM must intercept every instruction, a guest runs orders of magnitude slower than native. The only way to satisfy efficiency while preserving equivalence and control is trap-and-emulate: run the guest natively for most instructions and intercept only the instructions that touch privileged state.
For trap-and-emulate to work, the architecture must cooperate. Popek and Goldberg defined the relevant instruction classes. Privileged instructions trap when the processor is in user mode and do not trap when it is in supervisor mode. Sensitive instructions are those that either alter resource configuration (control-sensitive) or whose behavior depends on resource configuration (behavior-sensitive). Innocuous instructions — neither control-sensitive nor behavior-sensitive — require no VMM intervention and run at full hardware speed. The theorem follows directly:
"For any conventional third-generation computer, an effective VMM may be constructed if the set of sensitive instructions for that computer is a subset of the set of privileged instructions."
If every sensitive instruction is also privileged, then a VMM can deprivilege the guest OS — run it in a non-privileged ring — and know that any time the guest tries to touch privileged state, the CPU will trap to the VMM. The VMM inspects the instruction, emulates the intended effect within the guest's view of the machine, and returns control. Guest user-mode code never needs interception at all because it cannot access privileged state in the first place.
The 1974 paper analyzed the IBM 360, Honeywell 6000, and PDP-10. It predates x86 by several years, which turned out to matter enormously.
Why x86 Broke the Theorem
The IA-32 instruction set contains sensitive but non-privileged instructions — they execute at ring 3 or ring 1 without trapping, which means a VMM running the guest OS in a deprivileged ring cannot intercept them. The first-person account of the problem, and the definitive primary source for the count and consequences, is Bugnion, Devine, Govil, and Rosenblum, "Bringing Virtualization to the x86 Architecture with the Original VMware Workstation," ACM TOCS 2012, DOI 10.1145/2382553.2382554.
The most instructive of these is POPF. At ring 0, POPF pops a value from the stack into EFLAGS, including the Interrupt Flag (IF, bit 9) and the I/O Privilege Level field. At ring 1 with IOPL=0, the situation changes silently: CPL > IOPL, so POPF completes but does not modify the IF bit and does not trap. A VMM that deprivileges the guest OS kernel to ring 1 faces this consequence: a guest POPF that should disable interrupts — the instruction a kernel issues before a critical section — is silently swallowed. The guest OS believes interrupts are disabled. They are not. The CPU does not trap. The VMM never knows. The guest's interrupt-masking logic is now broken in a way that produces subtle, unpredictable misbehavior rather than an immediate fault.
PUSHF creates the mirror problem: it pushes EFLAGS to the stack and thereby reveals the real IF state to the guest. SGDT and SIDT are more aggressive still: they store the actual host GDTR and IDTR values — including the base linear addresses of the descriptor tables — to any guest-supplied memory address at any privilege level. A VMM running with real descriptor tables loaded cannot prevent a guest from reading those addresses. This violates the resource-control property directly: the VMM is not in complete control of all virtualized resources. SLDT stores the real LDTR; SMSW stores the low 16 bits of CR0. Together with instructions whose behavior varies by privilege level — LAR, LSL, VERR, VERW, segment register manipulation via POP, PUSH, and MOV, far calls via CALL FAR, JMP FAR, INT n, RETF, and STR — the full tally from the Bugnion et al. retrospective is seventeen or eighteen sensitive-but-not-privileged instructions on IA-32, depending on whether MOV to and from segment registers is counted as one or two distinct problematic forms.
Intel introduced UMIP (User-Mode Instruction Prevention, CR4.UMIP, bit 11) as a late partial mitigation: when enabled, SGDT, SIDT, SLDT, SMSW, and STR at CPL > 0 raise #GP(0). UMIP appeared with Cannon Lake and Goldmont Plus (Intel) and Zen 2 (AMD). It addresses the descriptor-table leak at user mode but does not retroactively make IA-32 virtualizable under Theorem 1 — it arrived decades after the problem was first identified and does not cover the full set.
The consequence is that IA-32 is, in Popek and Goldberg's formal sense, not virtualizable. Building a VMM for x86 required working around the architecture.
Three Approaches to an Unvirtualizable Architecture
Binary Translation
VMware's solution, first shipped as VMware Workstation in 1999, was dynamic binary translation. The VMM occupies ring 0. The guest OS kernel runs at ring 1 — a technique called ring compression because the full ring gap between user mode and supervisor mode is compressed into a single step. Guest applications continue at ring 3.
Before any basic block of guest kernel code executes, the VMM scans it for sensitive-but-not-privileged instructions and rewrites them with safe equivalents — typically calls into the VMM itself. Translated blocks are stored in a code cache so that frequently executed kernel paths are rewritten only once; the amortized overhead on a warm cache is small. Guest user-mode code runs natively without scanning, because user-mode code cannot issue privileged instructions even on bare hardware.
This achieves what Popek and Goldberg call full virtualization: the guest OS runs unmodified. The same kernel binary that boots on bare hardware boots inside the VM. The cost is translation overhead on guest kernel paths, particularly for kernel code that exercises the problematic instructions frequently.
Paravirtualization
The Xen hypervisor, presented at SOSP 2003 by Barham, Dragovic, Fraser, Hand, Harris, Ho, Neugebauer, Pratt, and Warfield, took a different position: instead of hiding the fact of virtualization, tell the guest OS it is running in a VM and let it cooperate. The hypervisor occupies ring 0. The guest OS kernel runs at ring 1. Guest applications run at ring 3.
The 17 problematic instructions are replaced in the guest OS kernel with explicit hypercalls — direct calls into the hypervisor's published interface. A guest that wants to update its page tables calls __HYPERVISOR_mmu_update. A guest that wants to set a new GDT calls __HYPERVISOR_set_gdt. The hypervisor validates the request and performs the operation. There is no code cache, no scanning, no rewriting. The path is direct.
The trade-off is that the guest OS must be ported. Xen shipped with modified Linux and NetBSD kernels; running an unmodified Windows guest required separate binary translation support. In Popek-Goldberg terms, paravirtualization sidesteps the sensitive-but-not-privileged problem by making the guest replace the problematic instructions itself — an approach that Popek and Goldberg's Theorem 3 licenses as a hybrid VMM construction.
Paravirtualization did not disappear when hardware-assisted virtualization arrived. The KVM paravirt MSR interface gives guests running under KVM access to information that hardware cannot provide natively: MSR_KVM_STEAL_TIME at address 0x4b564d03 reports the time a vCPU spent not scheduled on a physical core; MSR_KVM_SYSTEM_TIME_NEW at 0x4b564d01 provides a high-precision monotonic clock that accounts for TSC migration between cores. A guest detects KVM via CPUID leaf 0x40000001 — bit 3 of that leaf gates MSR_KVM_SYSTEM_TIME_NEW — and opts into these interfaces by writing to the MSRs. The device model is hardware-assisted; the time-keeping interface is still paravirtual.
Hardware-Assisted Virtualization
The cleanest solution was to fix the architecture. Intel shipped the first VT-x processors on November 14, 2005 — the Pentium 4 Prescott 2M, models 662 and 672. AMD followed with AMD-V on May 23, 2006 in Athlon 64 "Orleans" and Athlon 64 X2 "Windsor" processors. Both extensions solve the Popek-Goldberg problem by adding new CPU operating modes that make sensitive instructions either trap or become irrelevant.
VMX: The Mechanics of Hardware-Assisted Virtualization
Intel's implementation is called VMX (Virtual Machine Extensions). The CPU gains two orthogonal sub-modes, each with a full ring-0-through-ring-3 hierarchy:
- VMX root operation: where the hypervisor runs. Full access to all CPU state.
- VMX non-root operation: where guest code runs. Physically on the same silicon, but in a mode where designated instructions and events cause an automatic transfer of control to the hypervisor.
flowchart TB
subgraph root["VMX Root Operation"]
vmm["VMM / KVM (ring 0)"]
hostuser["Host userspace (ring 3)"]
end
subgraph nonroot["VMX Non-Root Operation"]
guestkernel["Guest kernel (ring 0)"]
guestuser["Guest userspace (ring 3)"]
end
vmm -- "VMLAUNCH / VMRESUME" --> guestkernel
guestkernel -- "VM exit" --> vmm
guestuser -- "syscall" --> guestkernel
The transition from VMX non-root to VMX root — a VM exit — happens automatically when the guest executes an instruction or triggers an event configured to cause an exit. The CPU saves the guest's register state, loads the host's register state, and jumps to the VMM's exit handler. The transition in the opposite direction — a VM entry — happens when the VMM executes VMLAUNCH (on the first entry for a given vCPU) or VMRESUME (on every subsequent entry). The guest OS sees none of this machinery. It runs at non-root ring 0 and believes it is the most privileged software on the machine; there is no software-visible register the guest can read to discover that it is in VMX non-root mode.
The data structure that controls all of this is the VMCS (Virtual Machine Control Structure): a 4 KiB-aligned in-memory region, one per vCPU, whose physical address is tracked by the CPU after the hypervisor executes VMPTRLD on it. The VMCS has six logical groups:
| Group | Purpose |
|---|---|
| Guest-state area | RSP, RIP, RFLAGS, CR0/CR3/CR4, segment registers and descriptors — saved on VM exit, loaded on VM entry |
| Host-state area | Host RIP, RSP, CR0/CR3/CR4, segment selectors — loaded on VM exit so the VMM resumes at a fixed handler |
| VM-execution control fields | Which events cause exits: CPUID, I/O ports, MSR reads/writes, EPT violations |
| VM-exit control fields | Exit-path behavior: 64-bit host, which MSRs to save/load |
| VM-entry control fields | Entry-path behavior: event injection, which MSRs to load |
| VM-exit information fields | Read-only after exit: exit reason, exit qualification, guest linear address |
The VMM reads and writes VMCS fields using the VMREAD and VMWRITE instructions with 16-bit field encodings — there is no direct memory-mapped access to the VMCS bytes. The physical layout is implementation-defined by the CPU vendor and not documented. The Linux kernel's VMCS field encodings are in arch/x86/include/asm/vmx.h; for example, GUEST_RIP is 0x0000681e, GUEST_CR0 is 0x00006800, CPU_BASED_VM_EXEC_CONTROL is 0x00004002, and HOST_RIP is 0x00006c16. The CPU also maintains a per-VMCS launch state — "clear" after VMCLEAR, "launched" after a successful VMLAUNCH — that is not readable via VMREAD. The hypervisor must use VMLAUNCH on the first VM entry and VMRESUME on all subsequent ones; the sequence is enforced by hardware.
Some instructions cause unconditional VM exits regardless of what the VMCS execution controls say: CPUID, GETSEC, INVD, XSETBV, and all VMX instructions including VMCALL and VMLAUNCH itself. Other exits are conditional, controlled by CPU_BASED_VM_EXEC_CONTROL and the secondary controls at SECONDARY_VM_EXEC_CONTROL (0x0000401e). The exit reasons are defined in arch/x86/include/uapi/asm/vmx.h, which lists 85 distinct codes. A selection:
| Code | Cause |
|---|---|
EXIT_REASON_EXCEPTION_NMI (0) |
Guest exception or NMI |
EXIT_REASON_CPUID (10) |
Guest executed CPUID |
EXIT_REASON_HLT (12) |
Guest executed HLT |
EXIT_REASON_VMCALL (18) |
Guest issued VMCALL |
EXIT_REASON_IO_INSTRUCTION (30) |
Guest executed IN or OUT |
EXIT_REASON_MSR_READ (31) |
Guest executed RDMSR |
EXIT_REASON_MSR_WRITE (32) |
Guest executed WRMSR |
EXIT_REASON_EPT_VIOLATION (48) |
Guest accessed an unmapped guest-physical address |
EXIT_REASON_EPT_MISCONFIG (49) |
EPT paging structure misconfigured |
Intel added 13 new instructions for the VMX interface:
| Instruction | Opcode | Purpose |
|---|---|---|
VMXON m64 |
F3 0F C7 /6 |
Enter VMX operation |
VMXOFF |
NP 0F 01 C4 |
Exit VMX operation |
VMPTRLD m64 |
NP 0F C7 /6 |
Make a VMCS current |
VMPTRST m64 |
NP 0F C7 /7 |
Store current VMCS pointer |
VMCLEAR m64 |
66 0F C7 /6 |
Initialize or clear a VMCS |
VMLAUNCH |
NP 0F 01 C2 |
VM entry (first launch) |
VMRESUME |
NP 0F 01 C3 |
VM entry (resume after exit) |
VMREAD reg,r/m |
NP 0F 78 /r |
Read a VMCS field |
VMWRITE reg,r/m |
NP 0F 79 /r |
Write a VMCS field |
VMCALL |
NP 0F 01 C1 |
Hypercall from guest to VMM |
INVEPT reg,m128 |
66 0F 38 80 /r |
Invalidate EPT TLB entries (Nehalem+) |
INVVPID reg,m128 |
66 0F 38 81 /r |
Invalidate VPID TLB entries |
VMFUNC |
NP 0F 01 D4 |
Invoke a VM function (Haswell / Silvermont+) |
Enabling VMX requires setting CR4.VMXE (bit 13) and then executing VMXON with the physical address of a 4 KiB-aligned VMXON region. Before VMXON, software reads the IA32_VMX_BASIC MSR at address 0x480 to obtain the VMCS revision identifier (bits 30:0) and writes it into bytes 0–3 of the VMXON region; a mismatch causes VMXON to fail. The MSR also encodes the size of both regions in bits 44:32 and the memory type for VMCS access in bits 53:50 (6 = write-back).
Memory Virtualization: The Second Translation
The VMCS and VM exits handle CPU state. Memory requires a parallel mechanism. A guest OS manages its own page tables, mapping guest-virtual addresses to what it believes are physical addresses. The host cannot use those "physical" addresses directly — the guest does not own physical DRAM in the traditional sense; it owns regions of host virtual memory that the VMM has allocated.
Intel's solution, introduced with Nehalem in 2008, is EPT (Extended Page Tables). EPT adds a second level of hardware page-table translation: after the guest's page tables map a guest-virtual address to a guest-physical address, the hardware's MMU automatically walks a second set of tables — maintained by the VMM but enforced by the hardware — to translate the guest-physical address to the host-physical address. Both translations happen entirely in hardware for most accesses. INVEPT invalidates EPT-derived TLB entries; INVVPID invalidates VPID TLB entries that tag TLB entries by per-VM identifier to allow host and guest TLB entries to coexist.
The KVM_SET_USER_MEMORY_REGION ioctl, covered in the next section, is what tells KVM where to build those EPT mappings. The guest's "physical" address 0x0 maps to whatever host-virtual address the VMM passed as userspace_addr. An EPT violation — exit reason EXIT_REASON_EPT_VIOLATION (48) — occurs when the guest accesses a guest-physical address the EPT does not yet map.
AMD's equivalent is NPT (Nested Page Tables, also called RVI — Rapid Virtualization Indexing), introduced with the third-generation Opteron "Barcelona" (Family 0x10). The mechanism is structurally identical to EPT; the instruction for TLB invalidation by ASID is INVLPGA.
KVM: The Linux Kernel's VMM
KVM (Kernel-based Virtual Machine) was developed by Avi Kivity at Qumranet and announced on October 19, 2006. It was merged into the Linux kernel in version 2.6.20, released February 5, 2007. Qumranet was acquired by Red Hat in 2008. KVM ships as three loadable kernel modules: kvm.ko (the arch-independent core), plus either kvm-intel.ko or kvm-amd.ko depending on the CPU vendor. The arch-specific backends live in arch/x86/kvm/vmx/vmx.c (VMX) and arch/x86/kvm/svm/svm.c (SVM). KVM exposes its interface through a character device at /dev/kvm.
Note:
/dev/kvmrequires read/write access — on most Linux distributions that means membership in thekvmgroup, or root.KVM_CREATE_VMon a machine without hardware virtualization support will fail withEINVALorENODEVdepending on the kernel version.
The KVM API is organized as a three-level file descriptor hierarchy:
open("/dev/kvm") → system fd
KVM_CREATE_VM → VM fd
KVM_CREATE_VCPU → vCPU fd
KVM_GET_API_VERSION on the system fd returns the constant 12 (KVM_API_VERSION = 12). This value has not changed since the API was frozen, and KVM documents it as a stable handshake; a caller that receives anything other than 12 has a kernel incompatibility. Every KVM ioctl uses KVMIO = 0xAE as the ioctl magic byte: KVM_GET_API_VERSION = _IO(KVMIO, 0x00), KVM_CREATE_VM = _IO(KVMIO, 0x01).
To run a guest, a VMM executes this sequence:
open("/dev/kvm")→ system fdioctl(kvm_fd, KVM_CREATE_VM, 0)→ VM fd (machine_type = 0for default x86)ioctl(vm_fd, KVM_SET_USER_MEMORY_REGION, ®ion)— maps hostmmap-backed pages into guest-physical spaceioctl(vm_fd, KVM_CREATE_VCPU, 0)→ vCPU fdmmap(NULL, mmap_size, PROT_READ|PROT_WRITE, MAP_SHARED, vcpu_fd, 0)→struct kvm_run *ioctl(vcpu_fd, KVM_RUN, 0)— run until a VM exit KVM cannot handle internally
The struct passed to KVM_SET_USER_MEMORY_REGION encodes exactly how host memory becomes guest memory:
The guest's "physical" address space is a range of the VMM's virtual address space. When the guest accesses guest-physical address 0x0, EPT translates that to the host-physical page backing userspace_addr + 0x0. The terminology is genuinely disorienting: the guest believes it has physical memory; the host sees a virtual address range that its own page tables back with physical DRAM. EPT is precisely the mechanism that makes both views consistent without any VMM intervention on most accesses.
After KVM_RUN returns, the VMM reads kvm_run.exit_reason from the shared page:
Selected exit reasons from include/uapi/linux/kvm.h (44 constants defined through KVM_EXIT_SNP_REQ_CERTS = 43 in current mainline):
| Value | Name | Cause |
|---|---|---|
| 0 | KVM_EXIT_UNKNOWN |
Hardware exit reason not recognized by KVM |
| 2 | KVM_EXIT_IO |
Guest executed IN or OUT |
| 3 | KVM_EXIT_HYPERCALL |
Guest executed VMCALL or VMMCALL |
| 5 | KVM_EXIT_HLT |
Guest executed HLT |
| 6 | KVM_EXIT_MMIO |
Guest accessed an MMIO region |
| 8 | KVM_EXIT_SHUTDOWN |
Guest triple-faulted or issued a CPU reset |
| 9 | KVM_EXIT_FAIL_ENTRY |
Hardware refused VM entry |
| 17 | KVM_EXIT_INTERNAL_ERROR |
KVM internal error |
| 24 | KVM_EXIT_SYSTEM_EVENT |
Guest requested shutdown or reset |
| 29 | KVM_EXIT_X86_RDMSR |
Guest read an MSR with no in-kernel handler |
| 30 | KVM_EXIT_X86_WRMSR |
Guest wrote an MSR with no in-kernel handler |
| 37 | KVM_EXIT_NOTIFY |
VM notification event |
| 39 | KVM_EXIT_MEMORY_FAULT |
Memory access fault |
When exit_reason is KVM_EXIT_IO (2), the io union arm carries the port number, direction, size, and data_offset. That last field is relative to the start of the kvm_run struct — the I/O data lives inline in the shared page, not in a separate buffer.
KVM_RUN returns to userspace only when KVM cannot handle an exit internally. KVM processes many exits entirely in the kernel: EPT violations that can be satisfied by existing mappings, in-kernel APIC accesses, in-kernel PIC and PIT emulation, and MSR reads/writes it handles itself. Only exits that genuinely require device emulation or policy decisions surface to the VMM userspace process. The performance cost of each round-trip through the VM exit path — the VMCS save/restore, the ring transition, the jump to the KVM exit handler, and the return — is non-trivial. When Spectre v2 mitigations introduced retpolines on the exit path, QEMU engineers restructured KVM's exit path to eliminate the indirect calls and recovered double-digit percentage performance improvements on exit-heavy workloads. Firecracker's answer is the smallest possible device model: only the exits that serverless workloads require.
One Kernel or Two
The mechanical difference between a container and a VM reduces to one question: how many kernel instances are running?
A container is a process or process tree whose system calls go directly into the host kernel's dispatch table. Linux namespaces partition the kernel's exported interfaces into isolated views — one set for mount points, another for the PID space, another for the network stack — and cgroups constrain the resources each group can consume. But no namespace or cgroup puts silicon between a container process and the kernel's system-call table. Every container on a host shares one running kernel, differentiated only by namespace membership. There is no second OS image. A vulnerability reachable via any system call available inside the container is a host-wide concern.
A VM carries a complete second kernel binary. That binary — on x86-64, a compressed bzImage following the x86 Linux boot protocol documented in Documentation/arch/x86/boot.rst — is loaded by the VMM into guest-physical memory, its entry point is written into the VMCS guest-state area at GUEST_RIP, and VMLAUNCH transfers control. The guest kernel boots, builds its own process table, installs its own interrupt handlers, manages its own memory map, and runs its own init. When a process inside the VM issues a system call, the syscall enters the guest kernel at non-root ring 0. The host kernel never sees it as a system call. The host sees only VM exits from the VMCS — I/O port accesses, MMIO accesses, MSR reads — which the VMM handles in userspace before re-entering the guest. The guest's syscall dispatch table, its kernel configuration, its module set, and its kernel CVE surface are all independent of the host.
flowchart TB
subgraph container["Container (one kernel)"]
cp["Container process"]
hk["Host kernel syscall table"]
cp -- "syscall" --> hk
end
subgraph vm["VM (two kernels)"]
gp["Guest process"]
gk["Guest kernel (bzImage)"]
vmm["VMM + KVM"]
hk2["Host kernel"]
gp -- "syscall" --> gk
gk -- "VM exit (I/O, MMIO, MSR)" --> vmm
vmm -- "ioctl(vcpu_fd, KVM_RUN)" --> hk2
end
A guest kernel crash surfaces as KVM_EXIT_SHUTDOWN (value 8) in the host VMM's kvm_run.exit_reason; the VMM tears down the VM, and the host kernel and all other VMs are unaffected. Escaping the VM requires exploiting the VMM or a hardware flaw in the virtualization extensions, not a guest kernel CVE. On the other side of the ledger: boot time is non-trivial even for a stripped guest kernel — tens of milliseconds to reach the first userspace process before counting VMM startup — and every VM carries its own kernel memory overhead, on the order of tens of MiB for a minimal configuration.
A container's startup time is proportional to clone(2) plus execve(2), measured in milliseconds, with no kernel boot. The isolation boundary is the host kernel's software enforcement of namespaces and cgroups, not a hardware mode transition.
That means understanding, and then ruthlessly minimizing, what a VMM is actually required to do.
Native Execution and Emulation Inside a VM
Within a running VM, two execution regimes alternate. Most of the time the guest runs in native execution: the CPU is in VMX non-root mode, guest ring 0 and ring 3 code executes directly on physical silicon at full hardware speed, and neither KVM nor the VMM interprets any instructions. This is what the efficiency property in Popek and Goldberg's theorem demands, and hardware-assisted virtualization delivers it: most guest instructions never involve the VMM at all.
Emulation happens when a VM exit occurs and the VMM must act before re-entering the guest. When the guest's kernel driver issues an OUT instruction to a port that maps to a virtual UART, the hardware exits to VMX root mode and KVM delivers KVM_EXIT_IO (value 2) to the VMM userspace process. The VMM reads the port, direction, and size from kvm_run.io, performs whatever logic the device model requires — in firecracker's case, forwarding bytes to the 16550A serial emulation — and calls ioctl(vcpu_fd, KVM_RUN, 0) again. The guest never knows the UART is a data structure in the VMM's heap.
sequenceDiagram
participant G as "Guest (VMX non-root)"
participant K as "KVM (kernel)"
participant V as "VMM userspace (firecracker)"
G->>K: OUT to serial port (VM exit)
K->>K: Check exit reason
alt KVM handles in-kernel
K->>G: VMRESUME
else requires userspace
K->>V: KVM_RUN returns, exit_reason = KVM_EXIT_IO
V->>V: 16550A device model
V->>K: ioctl(vcpu_fd, KVM_RUN, 0)
K->>G: VMRESUME
end
EPT violations that KVM can resolve — a first access to a mapped but not yet faulted-in page — are handled in the page-fault path without the VMM process being involved. Each exit that does surface to userspace carries the full cost of the VMM round-trip: VMCS save/restore, ring transition, exit handler dispatch, and re-entry. KVM_RUN returns to userspace only for exits the VMM must handle itself.
The exits that do reach userspace are the ones that constitute the device model. A traditional VMM like qemu-system-x86_64 emulates dozens of virtual devices, each with its own I/O port ranges and MMIO regions; every device access by the guest is a VM exit, a round-trip through the exit path, and a return to the guest. Firecracker reduces this to the smallest device set that serverless workloads require — virtio block, virtio net, virtio vsock, virtio rng, a 16550A serial console, an i8042 keyboard controller stub, and a balloon device — and devices that are not present cannot generate exits. The device count is a directly tunable multiplier on VMM overhead. Chapter 3 returns to what that overhead actually costs.
Sources And Further Reading
- Popek, G. J., and Goldberg, R. P., "Formal Requirements for Virtualizable Third Generation Architectures," Communications of the ACM 17(7), pp. 412–421, July 1974. The paper establishing the three VMM properties and the trap-and-emulate theorem. https://dl.acm.org/doi/10.1145/361011.361073
- Wikipedia treatment of the Popek-Goldberg requirements with original citation and theorem text: https://en.wikipedia.org/wiki/Popek_and_Goldberg_virtualization_requirements
- "The Morning Paper" summary of the Popek-Goldberg paper: https://blog.acolyer.org/2016/02/19/formal-requirements-for-virtualizable-third-generation-architectures/
- Bugnion, E., Devine, S., Govil, K., and Rosenblum, M., "Bringing Virtualization to the x86 Architecture with the Original VMware Workstation," ACM Transactions on Computer Systems, 2012. The definitive primary source on binary translation and the 17 sensitive-but-not-privileged IA-32 instructions. https://dl.acm.org/doi/pdf/10.1145/2382553.2382554
- Barham, P., et al., "Xen and the Art of Virtualization," ACM SOSP 2003. Introduces paravirtualization and the hypercall model. https://www.cl.cam.ac.uk/research/srg/netos/papers/2003-xensosp.pdf
- Xen hypercall table reference (
__HYPERVISOR_*constants fromxen.h): https://xenbits.xen.org/docs/unstable/hypercall/x86_64/include,public,xen.h.html POPF/PUSHFbehavior by privilege level: https://www.felixcloutier.com/x86/popf:popfd:popfq- LWN article on UMIP (
CR4.UMIP): https://lwn.net/Articles/721957/ SGDT/SIDTinformation-leak discussion: http://hypervsir.blogspot.com/2014/10/kernel-information-leak-with.html- x86 virtualization history (VT-x ship date November 14, 2005; AMD-V ship date May 23, 2006): https://en.wikipedia.org/wiki/X86_virtualization
- Intel SDM Volume 3C (VMX): https://cdrdv2-public.intel.com/671506/326019-sdm-vol-3c.pdf
VMLAUNCH/VMRESUMEinstruction reference: https://www.felixcloutier.com/x86/vmlaunch:vmresumeIA32_VMX_BASICMSR field layout (revision identifier, region size, memory type): https://scrapbox.io/vmm/IA32_VMX_BASIC- VMX instruction opcode table: https://en.wikipedia.org/wiki/List_of_x86_virtualization_instructions
- Linux kernel VMCS field encodings (
arch/x86/include/asm/vmx.h): https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/vmx.h - Linux kernel VMX exit reason codes (
arch/x86/include/uapi/asm/vmx.h): https://github.com/torvalds/linux/blob/master/arch/x86/include/uapi/asm/vmx.h - AMD APM Volume 2 (doc 24593) — the authoritative SVM / VMCB reference: https://www.amd.com/system/files/TechDocs/24593.pdf
- Second Level Address Translation (EPT / NPT): https://en.wikipedia.org/wiki/Second_Level_Address_Translation
- Avi Kivity, "KVM: the Linux Virtual Machine Monitor," OLS 2007: https://www.kernel.org/doc/ols/2007/ols2007v1-pages-225-230.pdf
- KVM merged into Linux 2.6.20 — LWN history: https://lwn.net/Articles/705160/
- KVM API documentation (canonical): https://docs.kernel.org/virt/kvm/api.html
- "Using the KVM API," LWN.net — walkthrough of the three-level fd hierarchy with annotated C code: https://lwn.net/Articles/658511/
- Linux UAPI
kvm.h(exit reason values,kvm_runstruct): https://github.com/torvalds/linux/blob/master/include/uapi/linux/kvm.h - KVM paravirt MSR documentation (
MSR_KVM_STEAL_TIME,MSR_KVM_SYSTEM_TIME_NEW,CPUIDleaf0x40000001): https://www.kernel.org/doc/html/latest/virt/kvm/x86/msr.html - QEMU blog — micro-optimizing KVM VM exits (Spectre retpoline overhead): https://www.qemu.org/2019/11/15/micro-optimizing-kvm-vmexits/
- Linux x86 boot protocol (
Documentation/arch/x86/boot.rst): https://www.kernel.org/doc/html/latest/arch/x86/boot.html - Firecracker design document: https://github.com/firecracker-microvm/firecracker/blob/main/docs/design.md
- OSDev VMX reference: https://wiki.osdev.org/VMX