Chapter 2: What A Virtual Machine Actually Is

Run a Linux container and a Linux VM on the same host and both feel like isolated environments: each has its own filesystem, its own process table, its own IP address. But one of them is issuing system calls directly into the host kernel's dispatch table, and the other is not. That difference is not cosmetic. It determines the threat model, the boot time, the overhead budget, and what the workload can do to the host. The boundary is drawn in hardware, at the privilege ring level the CPU enforces.

The Popek-Goldberg Theorem

In July 1974, Gerald Popek and Robert Goldberg published "Formal Requirements for Virtualizable Third Generation Architectures" in Communications of the ACM 17(7), pp. 412–421. The paper is short by modern standards, but it gave the field its vocabulary, and the problem it addressed was not abstract.

The question Popek and Goldberg asked was: what does a computer architecture need to guarantee so that a Virtual Machine Monitor (VMM) can be built for it? They wanted necessary and sufficient conditions, not a design sketch. A VMM must satisfy three properties:

Equivalence (fidelity): a program running under the VMM behaves essentially identically to running on bare hardware, modulo timing and resource availability.
Resource control (safety): the VMM maintains complete control over all virtualized resources. Guest code cannot directly access or modify resources the VMM has not granted.
Efficiency (performance): a statistically dominant fraction of instructions executes without VMM intervention.

Efficiency is the property that constrains architecture design. If the VMM must intercept every instruction, a guest runs orders of magnitude slower than native. The only way to satisfy efficiency while preserving equivalence and control is trap-and-emulate: run the guest natively for most instructions and intercept only the instructions that touch privileged state.

For trap-and-emulate to work, the architecture must cooperate. Popek and Goldberg defined the relevant instruction classes. Privileged instructions trap when the processor is in user mode and do not trap when it is in supervisor mode. Sensitive instructions are those that either alter resource configuration (control-sensitive) or whose behavior depends on resource configuration (behavior-sensitive). Innocuous instructions — neither control-sensitive nor behavior-sensitive — require no VMM intervention and run at full hardware speed. The theorem follows directly:

"For any conventional third-generation computer, an effective VMM may be constructed if the set of sensitive instructions for that computer is a subset of the set of privileged instructions."

If every sensitive instruction is also privileged, then a VMM can deprivilege the guest OS — run it in a non-privileged ring — and know that any time the guest tries to touch privileged state, the CPU will trap to the VMM. The VMM inspects the instruction, emulates the intended effect within the guest's view of the machine, and returns control. Guest user-mode code never needs interception at all because it cannot access privileged state in the first place.

The 1974 paper analyzed the IBM 360, Honeywell 6000, and PDP-10. It predates x86 by several years, which turned out to matter enormously.

Why x86 Broke the Theorem

The IA-32 instruction set contains sensitive but non-privileged instructions — they execute at ring 3 or ring 1 without trapping, which means a VMM running the guest OS in a deprivileged ring cannot intercept them. The first-person account of the problem, and the definitive primary source for the count and consequences, is Bugnion, Devine, Govil, and Rosenblum, "Bringing Virtualization to the x86 Architecture with the Original VMware Workstation," ACM TOCS 2012, DOI 10.1145/2382553.2382554.

The most instructive of these is POPF. At ring 0, POPF pops a value from the stack into EFLAGS, including the Interrupt Flag (IF, bit 9) and the I/O Privilege Level field. At ring 1 with IOPL=0, the situation changes silently: CPL > IOPL, so POPF completes but does not modify the IF bit and does not trap. A VMM that deprivileges the guest OS kernel to ring 1 faces this consequence: a guest POPF that should disable interrupts — the instruction a kernel issues before a critical section — is silently swallowed. The guest OS believes interrupts are disabled. They are not. The CPU does not trap. The VMM never knows. The guest's interrupt-masking logic is now broken in a way that produces subtle, unpredictable misbehavior rather than an immediate fault.

PUSHF creates the mirror problem: it pushes EFLAGS to the stack and thereby reveals the real IF state to the guest. SGDT and SIDT are more aggressive still: they store the actual host GDTR and IDTR values — including the base linear addresses of the descriptor tables — to any guest-supplied memory address at any privilege level. A VMM running with real descriptor tables loaded cannot prevent a guest from reading those addresses. This violates the resource-control property directly: the VMM is not in complete control of all virtualized resources. SLDT stores the real LDTR; SMSW stores the low 16 bits of CR0. Together with instructions whose behavior varies by privilege level — LAR, LSL, VERR, VERW, segment register manipulation via POP, PUSH, and MOV, far calls via CALL FAR, JMP FAR, INT n, RETF, and STR — the full tally from the Bugnion et al. retrospective is seventeen or eighteen sensitive-but-not-privileged instructions on IA-32, depending on whether MOV to and from segment registers is counted as one or two distinct problematic forms.

Intel introduced UMIP (User-Mode Instruction Prevention, CR4.UMIP, bit 11) as a late partial mitigation: when enabled, SGDT, SIDT, SLDT, SMSW, and STR at CPL > 0 raise #GP(0). UMIP appeared with Cannon Lake and Goldmont Plus (Intel) and Zen 2 (AMD). It addresses the descriptor-table leak at user mode but does not retroactively make IA-32 virtualizable under Theorem 1 — it arrived decades after the problem was first identified and does not cover the full set.

The consequence is that IA-32 is, in Popek and Goldberg's formal sense, not virtualizable. Building a VMM for x86 required working around the architecture.

Three Approaches to an Unvirtualizable Architecture

Binary Translation

VMware's solution, first shipped as VMware Workstation in 1999, was dynamic binary translation. The VMM occupies ring 0. The guest OS kernel runs at ring 1 — a technique called ring compression because the full ring gap between user mode and supervisor mode is compressed into a single step. Guest applications continue at ring 3.

Before any basic block of guest kernel code executes, the VMM scans it for sensitive-but-not-privileged instructions and rewrites them with safe equivalents — typically calls into the VMM itself. Translated blocks are stored in a code cache so that frequently executed kernel paths are rewritten only once; the amortized overhead on a warm cache is small. Guest user-mode code runs natively without scanning, because user-mode code cannot issue privileged instructions even on bare hardware.

This achieves what Popek and Goldberg call full virtualization: the guest OS runs unmodified. The same kernel binary that boots on bare hardware boots inside the VM. The cost is translation overhead on guest kernel paths, particularly for kernel code that exercises the problematic instructions frequently.

Paravirtualization

The Xen hypervisor, presented at SOSP 2003 by Barham, Dragovic, Fraser, Hand, Harris, Ho, Neugebauer, Pratt, and Warfield, took a different position: instead of hiding the fact of virtualization, tell the guest OS it is running in a VM and let it cooperate. The hypervisor occupies ring 0. The guest OS kernel runs at ring 1. Guest applications run at ring 3.

The 17 problematic instructions are replaced in the guest OS kernel with explicit hypercalls — direct calls into the hypervisor's published interface. A guest that wants to update its page tables calls __HYPERVISOR_mmu_update. A guest that wants to set a new GDT calls __HYPERVISOR_set_gdt. The hypervisor validates the request and performs the operation. There is no code cache, no scanning, no rewriting. The path is direct.

__HYPERVISOR_set_trap_table (0) __HYPERVISOR_mmu_update (1) __HYPERVISOR_set_gdt (2) __HYPERVISOR_stack_switch (3) __HYPERVISOR_set_callbacks (4) __HYPERVISOR_memory_op (12) __HYPERVISOR_update_va_mapping (14) __HYPERVISOR_grant_table_op (20) __HYPERVISOR_event_channel_op (32) __HYPERVISOR_hvm_op (34)

The trade-off is that the guest OS must be ported. Xen shipped with modified Linux and NetBSD kernels; running an unmodified Windows guest required separate binary translation support. In Popek-Goldberg terms, paravirtualization sidesteps the sensitive-but-not-privileged problem by making the guest replace the problematic instructions itself — an approach that Popek and Goldberg's Theorem 3 licenses as a hybrid VMM construction.

Paravirtualization did not disappear when hardware-assisted virtualization arrived. The KVM paravirt MSR interface gives guests running under KVM access to information that hardware cannot provide natively: MSR_KVM_STEAL_TIME at address 0x4b564d03 reports the time a vCPU spent not scheduled on a physical core; MSR_KVM_SYSTEM_TIME_NEW at 0x4b564d01 provides a high-precision monotonic clock that accounts for TSC migration between cores. A guest detects KVM via CPUID leaf 0x40000001 — bit 3 of that leaf gates MSR_KVM_SYSTEM_TIME_NEW — and opts into these interfaces by writing to the MSRs. The device model is hardware-assisted; the time-keeping interface is still paravirtual.

Hardware-Assisted Virtualization

The cleanest solution was to fix the architecture. Intel shipped the first VT-x processors on November 14, 2005 — the Pentium 4 Prescott 2M, models 662 and 672. AMD followed with AMD-V on May 23, 2006 in Athlon 64 "Orleans" and Athlon 64 X2 "Windsor" processors. Both extensions solve the Popek-Goldberg problem by adding new CPU operating modes that make sensitive instructions either trap or become irrelevant.

VMX: The Mechanics of Hardware-Assisted Virtualization

Intel's implementation is called VMX (Virtual Machine Extensions). The CPU gains two orthogonal sub-modes, each with a full ring-0-through-ring-3 hierarchy:

VMX root operation: where the hypervisor runs. Full access to all CPU state.
VMX non-root operation: where guest code runs. Physically on the same silicon, but in a mode where designated instructions and events cause an automatic transfer of control to the hypervisor.

flowchart TB
  subgraph root["VMX Root Operation"]
    vmm["VMM / KVM (ring 0)"]
    hostuser["Host userspace (ring 3)"]
  end
  subgraph nonroot["VMX Non-Root Operation"]
    guestkernel["Guest kernel (ring 0)"]
    guestuser["Guest userspace (ring 3)"]
  end
  vmm -- "VMLAUNCH / VMRESUME" --> guestkernel
  guestkernel -- "VM exit" --> vmm
  guestuser -- "syscall" --> guestkernel

The transition from VMX non-root to VMX root — a VM exit — happens automatically when the guest executes an instruction or triggers an event configured to cause an exit. The CPU saves the guest's register state, loads the host's register state, and jumps to the VMM's exit handler. The transition in the opposite direction — a VM entry — happens when the VMM executes VMLAUNCH (on the first entry for a given vCPU) or VMRESUME (on every subsequent entry). The guest OS sees none of this machinery. It runs at non-root ring 0 and believes it is the most privileged software on the machine; there is no software-visible register the guest can read to discover that it is in VMX non-root mode.

The data structure that controls all of this is the VMCS (Virtual Machine Control Structure): a 4 KiB-aligned in-memory region, one per vCPU, whose physical address is tracked by the CPU after the hypervisor executes VMPTRLD on it. The VMCS has six logical groups:

Group	Purpose
Guest-state area	RSP, RIP, RFLAGS, CR0/CR3/CR4, segment registers and descriptors — saved on VM exit, loaded on VM entry
Host-state area	Host RIP, RSP, CR0/CR3/CR4, segment selectors — loaded on VM exit so the VMM resumes at a fixed handler
VM-execution control fields	Which events cause exits: `CPUID`, I/O ports, MSR reads/writes, EPT violations
VM-exit control fields	Exit-path behavior: 64-bit host, which MSRs to save/load
VM-entry control fields	Entry-path behavior: event injection, which MSRs to load
VM-exit information fields	Read-only after exit: exit reason, exit qualification, guest linear address

The VMM reads and writes VMCS fields using the VMREAD and VMWRITE instructions with 16-bit field encodings — there is no direct memory-mapped access to the VMCS bytes. The physical layout is implementation-defined by the CPU vendor and not documented. The Linux kernel's VMCS field encodings are in arch/x86/include/asm/vmx.h; for example, GUEST_RIP is 0x0000681e, GUEST_CR0 is 0x00006800, CPU_BASED_VM_EXEC_CONTROL is 0x00004002, and HOST_RIP is 0x00006c16. The CPU also maintains a per-VMCS launch state — "clear" after VMCLEAR, "launched" after a successful VMLAUNCH — that is not readable via VMREAD. The hypervisor must use VMLAUNCH on the first VM entry and VMRESUME on all subsequent ones; the sequence is enforced by hardware.

Some instructions cause unconditional VM exits regardless of what the VMCS execution controls say: CPUID, GETSEC, INVD, XSETBV, and all VMX instructions including VMCALL and VMLAUNCH itself. Other exits are conditional, controlled by CPU_BASED_VM_EXEC_CONTROL and the secondary controls at SECONDARY_VM_EXEC_CONTROL (0x0000401e). The exit reasons are defined in arch/x86/include/uapi/asm/vmx.h, which lists 85 distinct codes. A selection:

Code	Cause
`EXIT_REASON_EXCEPTION_NMI` (0)	Guest exception or NMI
`EXIT_REASON_CPUID` (10)	Guest executed `CPUID`
`EXIT_REASON_HLT` (12)	Guest executed `HLT`
`EXIT_REASON_VMCALL` (18)	Guest issued `VMCALL`
`EXIT_REASON_IO_INSTRUCTION` (30)	Guest executed `IN` or `OUT`
`EXIT_REASON_MSR_READ` (31)	Guest executed `RDMSR`
`EXIT_REASON_MSR_WRITE` (32)	Guest executed `WRMSR`
`EXIT_REASON_EPT_VIOLATION` (48)	Guest accessed an unmapped guest-physical address
`EXIT_REASON_EPT_MISCONFIG` (49)	EPT paging structure misconfigured

Intel added 13 new instructions for the VMX interface:

Instruction	Opcode	Purpose
`VMXON m64`	`F3 0F C7 /6`	Enter VMX operation
`VMXOFF`	`NP 0F 01 C4`	Exit VMX operation
`VMPTRLD m64`	`NP 0F C7 /6`	Make a VMCS current
`VMPTRST m64`	`NP 0F C7 /7`	Store current VMCS pointer
`VMCLEAR m64`	`66 0F C7 /6`	Initialize or clear a VMCS
`VMLAUNCH`	`NP 0F 01 C2`	VM entry (first launch)
`VMRESUME`	`NP 0F 01 C3`	VM entry (resume after exit)
`VMREAD reg,r/m`	`NP 0F 78 /r`	Read a VMCS field
`VMWRITE reg,r/m`	`NP 0F 79 /r`	Write a VMCS field
`VMCALL`	`NP 0F 01 C1`	Hypercall from guest to VMM
`INVEPT reg,m128`	`66 0F 38 80 /r`	Invalidate EPT TLB entries (Nehalem+)
`INVVPID reg,m128`	`66 0F 38 81 /r`	Invalidate VPID TLB entries
`VMFUNC`	`NP 0F 01 D4`	Invoke a VM function (Haswell / Silvermont+)

Enabling VMX requires setting CR4.VMXE (bit 13) and then executing VMXON with the physical address of a 4 KiB-aligned VMXON region. Before VMXON, software reads the IA32_VMX_BASIC MSR at address 0x480 to obtain the VMCS revision identifier (bits 30:0) and writes it into bytes 0–3 of the VMXON region; a mismatch causes VMXON to fail. The MSR also encodes the size of both regions in bits 44:32 and the memory type for VMCS access in bits 53:50 (6 = write-back).

Memory Virtualization: The Second Translation

The VMCS and VM exits handle CPU state. Memory requires a parallel mechanism. A guest OS manages its own page tables, mapping guest-virtual addresses to what it believes are physical addresses. The host cannot use those "physical" addresses directly — the guest does not own physical DRAM in the traditional sense; it owns regions of host virtual memory that the VMM has allocated.

Intel's solution, introduced with Nehalem in 2008, is EPT (Extended Page Tables). EPT adds a second level of hardware page-table translation: after the guest's page tables map a guest-virtual address to a guest-physical address, the hardware's MMU automatically walks a second set of tables — maintained by the VMM but enforced by the hardware — to translate the guest-physical address to the host-physical address. Both translations happen entirely in hardware for most accesses. INVEPT invalidates EPT-derived TLB entries; INVVPID invalidates VPID TLB entries that tag TLB entries by per-VM identifier to allow host and guest TLB entries to coexist.

The KVM_SET_USER_MEMORY_REGION ioctl, covered in the next section, is what tells KVM where to build those EPT mappings. The guest's "physical" address 0x0 maps to whatever host-virtual address the VMM passed as userspace_addr. An EPT violation — exit reason EXIT_REASON_EPT_VIOLATION (48) — occurs when the guest accesses a guest-physical address the EPT does not yet map.

AMD's equivalent is NPT (Nested Page Tables, also called RVI — Rapid Virtualization Indexing), introduced with the third-generation Opteron "Barcelona" (Family 0x10). The mechanism is structurally identical to EPT; the instruction for TLB invalidation by ASID is INVLPGA.

KVM: The Linux Kernel's VMM

KVM (Kernel-based Virtual Machine) was developed by Avi Kivity at Qumranet and announced on October 19, 2006. It was merged into the Linux kernel in version 2.6.20, released February 5, 2007. Qumranet was acquired by Red Hat in 2008. KVM ships as three loadable kernel modules: kvm.ko (the arch-independent core), plus either kvm-intel.ko or kvm-amd.ko depending on the CPU vendor. The arch-specific backends live in arch/x86/kvm/vmx/vmx.c (VMX) and arch/x86/kvm/svm/svm.c (SVM). KVM exposes its interface through a character device at /dev/kvm.

Note: /dev/kvm requires read/write access — on most Linux distributions that means membership in the kvm group, or root. KVM_CREATE_VM on a machine without hardware virtualization support will fail with EINVAL or ENODEV depending on the kernel version.

The KVM API is organized as a three-level file descriptor hierarchy:

open("/dev/kvm")          → system fd
  KVM_CREATE_VM           → VM fd
    KVM_CREATE_VCPU       → vCPU fd

KVM_GET_API_VERSION on the system fd returns the constant 12 (KVM_API_VERSION = 12). This value has not changed since the API was frozen, and KVM documents it as a stable handshake; a caller that receives anything other than 12 has a kernel incompatibility. Every KVM ioctl uses KVMIO = 0xAE as the ioctl magic byte: KVM_GET_API_VERSION = _IO(KVMIO, 0x00), KVM_CREATE_VM = _IO(KVMIO, 0x01).

To run a guest, a VMM executes this sequence:

open("/dev/kvm") → system fd
ioctl(kvm_fd, KVM_CREATE_VM, 0) → VM fd (machine_type = 0 for default x86)
ioctl(vm_fd, KVM_SET_USER_MEMORY_REGION, &region) — maps host mmap-backed pages into guest-physical space
ioctl(vm_fd, KVM_CREATE_VCPU, 0) → vCPU fd
mmap(NULL, mmap_size, PROT_READ|PROT_WRITE, MAP_SHARED, vcpu_fd, 0) → struct kvm_run *
ioctl(vcpu_fd, KVM_RUN, 0) — run until a VM exit KVM cannot handle internally

The struct passed to KVM_SET_USER_MEMORY_REGION encodes exactly how host memory becomes guest memory:

struct kvm_userspace_memory_region { __u32 slot; /* bits 0–15: slot id; bits 16–31: address space id (requires KVM_CAP_MULTI_ADDRESS_SPACE) */ __u32 flags; /* KVM_MEM_LOG_DIRTY_PAGES = 1; KVM_MEM_READONLY = 2 */ __u64 guest_phys_addr; __u64 memory_size; /* bytes */ __u64 userspace_addr; /* host virtual address of the backing mmap */ };

The guest's "physical" address space is a range of the VMM's virtual address space. When the guest accesses guest-physical address 0x0, EPT translates that to the host-physical page backing userspace_addr + 0x0. The terminology is genuinely disorienting: the guest believes it has physical memory; the host sees a virtual address range that its own page tables back with physical DRAM. EPT is precisely the mechanism that makes both views consistent without any VMM intervention on most accesses.

After KVM_RUN returns, the VMM reads kvm_run.exit_reason from the shared page:

struct kvm_run { /* in */ __u8 request_interrupt_window; __u8 immediate_exit; __u8 padding1[6]; /* out */ __u32 exit_reason; __u8 ready_for_interrupt_injection; __u8 if_flag; __u16 flags; /* in/out */ __u64 cr8; __u64 apic_base; union { struct { __u8 direction; __u8 size; __u16 port; __u32 count; __u64 data_offset; } io; /* KVM_EXIT_IO */ struct { __u64 phys_addr; __u8 data[8]; __u32 len; __u8 is_write; } mmio; /* KVM_EXIT_MMIO */ struct { __u64 hardware_exit_reason; } hw; /* KVM_EXIT_UNKNOWN */ struct { __u32 suberror; __u32 ndata; __u64 data[16]; } internal; /* KVM_EXIT_INTERNAL_ERROR */ /* ... additional union arms ... */ }; };

Selected exit reasons from include/uapi/linux/kvm.h (44 constants defined through KVM_EXIT_SNP_REQ_CERTS = 43 in current mainline):

Value	Name	Cause
0	`KVM_EXIT_UNKNOWN`	Hardware exit reason not recognized by KVM
2	`KVM_EXIT_IO`	Guest executed `IN` or `OUT`
3	`KVM_EXIT_HYPERCALL`	Guest executed `VMCALL` or `VMMCALL`
5	`KVM_EXIT_HLT`	Guest executed `HLT`
6	`KVM_EXIT_MMIO`	Guest accessed an MMIO region
8	`KVM_EXIT_SHUTDOWN`	Guest triple-faulted or issued a CPU reset
9	`KVM_EXIT_FAIL_ENTRY`	Hardware refused VM entry
17	`KVM_EXIT_INTERNAL_ERROR`	KVM internal error
24	`KVM_EXIT_SYSTEM_EVENT`	Guest requested shutdown or reset
29	`KVM_EXIT_X86_RDMSR`	Guest read an MSR with no in-kernel handler
30	`KVM_EXIT_X86_WRMSR`	Guest wrote an MSR with no in-kernel handler
37	`KVM_EXIT_NOTIFY`	VM notification event
39	`KVM_EXIT_MEMORY_FAULT`	Memory access fault

When exit_reason is KVM_EXIT_IO (2), the io union arm carries the port number, direction, size, and data_offset. That last field is relative to the start of the kvm_run struct — the I/O data lives inline in the shared page, not in a separate buffer.

KVM_RUN returns to userspace only when KVM cannot handle an exit internally. KVM processes many exits entirely in the kernel: EPT violations that can be satisfied by existing mappings, in-kernel APIC accesses, in-kernel PIC and PIT emulation, and MSR reads/writes it handles itself. Only exits that genuinely require device emulation or policy decisions surface to the VMM userspace process. The performance cost of each round-trip through the VM exit path — the VMCS save/restore, the ring transition, the jump to the KVM exit handler, and the return — is non-trivial. When Spectre v2 mitigations introduced retpolines on the exit path, QEMU engineers restructured KVM's exit path to eliminate the indirect calls and recovered double-digit percentage performance improvements on exit-heavy workloads. Firecracker's answer is the smallest possible device model: only the exits that serverless workloads require.

One Kernel or Two

The mechanical difference between a container and a VM reduces to one question: how many kernel instances are running?

A container is a process or process tree whose system calls go directly into the host kernel's dispatch table. Linux namespaces partition the kernel's exported interfaces into isolated views — one set for mount points, another for the PID space, another for the network stack — and cgroups constrain the resources each group can consume. But no namespace or cgroup puts silicon between a container process and the kernel's system-call table. Every container on a host shares one running kernel, differentiated only by namespace membership. There is no second OS image. A vulnerability reachable via any system call available inside the container is a host-wide concern.

A VM carries a complete second kernel binary. That binary — on x86-64, a compressed bzImage following the x86 Linux boot protocol documented in Documentation/arch/x86/boot.rst — is loaded by the VMM into guest-physical memory, its entry point is written into the VMCS guest-state area at GUEST_RIP, and VMLAUNCH transfers control. The guest kernel boots, builds its own process table, installs its own interrupt handlers, manages its own memory map, and runs its own init. When a process inside the VM issues a system call, the syscall enters the guest kernel at non-root ring 0. The host kernel never sees it as a system call. The host sees only VM exits from the VMCS — I/O port accesses, MMIO accesses, MSR reads — which the VMM handles in userspace before re-entering the guest. The guest's syscall dispatch table, its kernel configuration, its module set, and its kernel CVE surface are all independent of the host.

flowchart TB
  subgraph container["Container (one kernel)"]
    cp["Container process"]
    hk["Host kernel syscall table"]
    cp -- "syscall" --> hk
  end
  subgraph vm["VM (two kernels)"]
    gp["Guest process"]
    gk["Guest kernel (bzImage)"]
    vmm["VMM + KVM"]
    hk2["Host kernel"]
    gp -- "syscall" --> gk
    gk -- "VM exit (I/O, MMIO, MSR)" --> vmm
    vmm -- "ioctl(vcpu_fd, KVM_RUN)" --> hk2
  end

A guest kernel crash surfaces as KVM_EXIT_SHUTDOWN (value 8) in the host VMM's kvm_run.exit_reason; the VMM tears down the VM, and the host kernel and all other VMs are unaffected. Escaping the VM requires exploiting the VMM or a hardware flaw in the virtualization extensions, not a guest kernel CVE. On the other side of the ledger: boot time is non-trivial even for a stripped guest kernel — tens of milliseconds to reach the first userspace process before counting VMM startup — and every VM carries its own kernel memory overhead, on the order of tens of MiB for a minimal configuration.

A container's startup time is proportional to clone(2) plus execve(2), measured in milliseconds, with no kernel boot. The isolation boundary is the host kernel's software enforcement of namespaces and cgroups, not a hardware mode transition.

That means understanding, and then ruthlessly minimizing, what a VMM is actually required to do.

Native Execution and Emulation Inside a VM

Within a running VM, two execution regimes alternate. Most of the time the guest runs in native execution: the CPU is in VMX non-root mode, guest ring 0 and ring 3 code executes directly on physical silicon at full hardware speed, and neither KVM nor the VMM interprets any instructions. This is what the efficiency property in Popek and Goldberg's theorem demands, and hardware-assisted virtualization delivers it: most guest instructions never involve the VMM at all.

Emulation happens when a VM exit occurs and the VMM must act before re-entering the guest. When the guest's kernel driver issues an OUT instruction to a port that maps to a virtual UART, the hardware exits to VMX root mode and KVM delivers KVM_EXIT_IO (value 2) to the VMM userspace process. The VMM reads the port, direction, and size from kvm_run.io, performs whatever logic the device model requires — in firecracker's case, forwarding bytes to the 16550A serial emulation — and calls ioctl(vcpu_fd, KVM_RUN, 0) again. The guest never knows the UART is a data structure in the VMM's heap.

sequenceDiagram
  participant G as "Guest (VMX non-root)"
  participant K as "KVM (kernel)"
  participant V as "VMM userspace (firecracker)"

  G->>K: OUT to serial port (VM exit)
  K->>K: Check exit reason
  alt KVM handles in-kernel
    K->>G: VMRESUME
  else requires userspace
    K->>V: KVM_RUN returns, exit_reason = KVM_EXIT_IO
    V->>V: 16550A device model
    V->>K: ioctl(vcpu_fd, KVM_RUN, 0)
    K->>G: VMRESUME
  end

EPT violations that KVM can resolve — a first access to a mapped but not yet faulted-in page — are handled in the page-fault path without the VMM process being involved. Each exit that does surface to userspace carries the full cost of the VMM round-trip: VMCS save/restore, ring transition, exit handler dispatch, and re-entry. KVM_RUN returns to userspace only for exits the VMM must handle itself.

The exits that do reach userspace are the ones that constitute the device model. A traditional VMM like qemu-system-x86_64 emulates dozens of virtual devices, each with its own I/O port ranges and MMIO regions; every device access by the guest is a VM exit, a round-trip through the exit path, and a return to the guest. Firecracker reduces this to the smallest device set that serverless workloads require — virtio block, virtio net, virtio vsock, virtio rng, a 16550A serial console, an i8042 keyboard controller stub, and a balloon device — and devices that are not present cannot generate exits. The device count is a directly tunable multiplier on VMM overhead. Chapter 3 returns to what that overhead actually costs.

Sources And Further Reading

Popek, G. J., and Goldberg, R. P., "Formal Requirements for Virtualizable Third Generation Architectures," Communications of the ACM 17(7), pp. 412–421, July 1974. The paper establishing the three VMM properties and the trap-and-emulate theorem. https://dl.acm.org/doi/10.1145/361011.361073
Wikipedia treatment of the Popek-Goldberg requirements with original citation and theorem text: https://en.wikipedia.org/wiki/Popek_and_Goldberg_virtualization_requirements
"The Morning Paper" summary of the Popek-Goldberg paper: https://blog.acolyer.org/2016/02/19/formal-requirements-for-virtualizable-third-generation-architectures/
Bugnion, E., Devine, S., Govil, K., and Rosenblum, M., "Bringing Virtualization to the x86 Architecture with the Original VMware Workstation," ACM Transactions on Computer Systems, 2012. The definitive primary source on binary translation and the 17 sensitive-but-not-privileged IA-32 instructions. https://dl.acm.org/doi/pdf/10.1145/2382553.2382554
Barham, P., et al., "Xen and the Art of Virtualization," ACM SOSP 2003. Introduces paravirtualization and the hypercall model. https://www.cl.cam.ac.uk/research/srg/netos/papers/2003-xensosp.pdf
Xen hypercall table reference (__HYPERVISOR_* constants from xen.h): https://xenbits.xen.org/docs/unstable/hypercall/x86_64/include,public,xen.h.html
POPF/PUSHF behavior by privilege level: https://www.felixcloutier.com/x86/popf:popfd:popfq
LWN article on UMIP (CR4.UMIP): https://lwn.net/Articles/721957/
SGDT/SIDT information-leak discussion: http://hypervsir.blogspot.com/2014/10/kernel-information-leak-with.html
x86 virtualization history (VT-x ship date November 14, 2005; AMD-V ship date May 23, 2006): https://en.wikipedia.org/wiki/X86_virtualization
Intel SDM Volume 3C (VMX): https://cdrdv2-public.intel.com/671506/326019-sdm-vol-3c.pdf
VMLAUNCH/VMRESUME instruction reference: https://www.felixcloutier.com/x86/vmlaunch:vmresume
IA32_VMX_BASIC MSR field layout (revision identifier, region size, memory type): https://scrapbox.io/vmm/IA32_VMX_BASIC
VMX instruction opcode table: https://en.wikipedia.org/wiki/List_of_x86_virtualization_instructions
Linux kernel VMCS field encodings (arch/x86/include/asm/vmx.h): https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/vmx.h
Linux kernel VMX exit reason codes (arch/x86/include/uapi/asm/vmx.h): https://github.com/torvalds/linux/blob/master/arch/x86/include/uapi/asm/vmx.h
AMD APM Volume 2 (doc 24593) — the authoritative SVM / VMCB reference: https://www.amd.com/system/files/TechDocs/24593.pdf
Second Level Address Translation (EPT / NPT): https://en.wikipedia.org/wiki/Second_Level_Address_Translation
Avi Kivity, "KVM: the Linux Virtual Machine Monitor," OLS 2007: https://www.kernel.org/doc/ols/2007/ols2007v1-pages-225-230.pdf
KVM merged into Linux 2.6.20 — LWN history: https://lwn.net/Articles/705160/
KVM API documentation (canonical): https://docs.kernel.org/virt/kvm/api.html
"Using the KVM API," LWN.net — walkthrough of the three-level fd hierarchy with annotated C code: https://lwn.net/Articles/658511/
Linux UAPI kvm.h (exit reason values, kvm_run struct): https://github.com/torvalds/linux/blob/master/include/uapi/linux/kvm.h
KVM paravirt MSR documentation (MSR_KVM_STEAL_TIME, MSR_KVM_SYSTEM_TIME_NEW, CPUID leaf 0x40000001): https://www.kernel.org/doc/html/latest/virt/kvm/x86/msr.html
QEMU blog — micro-optimizing KVM VM exits (Spectre retpoline overhead): https://www.qemu.org/2019/11/15/micro-optimizing-kvm-vmexits/
Linux x86 boot protocol (Documentation/arch/x86/boot.rst): https://www.kernel.org/doc/html/latest/arch/x86/boot.html
Firecracker design document: https://github.com/firecracker-microvm/firecracker/blob/main/docs/design.md
OSDev VMX reference: https://wiki.osdev.org/VMX