Appendix D: Glossary

The terms below are ordered by where they live in the stack — hardware extensions first, then address spaces, then device interfaces, then Firecracker-specific mechanisms. Within each entry, the claim about a register, field, ioctl, or version matches the primary source cited at the entry's end.

VMX and SVM: The Hardware Foundation

The CPU has to solve a specific problem before any of the rest of the stack makes sense: how does the host let guest code run directly on hardware, at full speed, without trusting it? The answer is a new CPU operating mode, introduced independently by Intel and AMD.

VMX (Virtual Machine Extensions) is Intel's hardware virtualization ISA extension, also marketed as Intel VT-x. A processor supports VMX when CPUID.1:ECX.VMX[bit 5] returns 1. To activate it, the system software sets CR4.VMXE (bit 13 of CR4) to 1 and executes the VMXON instruction with a pointer to a 4 KB-aligned VMXON region whose first four bytes hold the VMCS revision identifier, read from IA32_VMX_BASIC[30:0] (MSR 0x480). From that point, the processor operates in VMX operation and the root/non-root distinction becomes meaningful.

SVM (Secure Virtual Machine) is AMD's equivalent, also called AMD-V. Detection: CPUID Fn8000_0001_ECX[bit 2] = 1. Enabling: set EFER.SVME (bit 12 of EFER, MSR 0xC0000080) to 1, then write the host save area physical address to MSR_VM_HSAVE_PA (MSR 0xC0010117). SVM uses the VMRUN instruction where VMX uses VMLAUNCH/VMRESUME, and #VMEXIT where VMX uses "VM exit."

The Linux KVM module's x86 backend handles both. arch/x86/kvm/vmx/vmx.c contains the VMX path; arch/x86/kvm/svm/svm.c contains the SVM path. At boot, kvm_intel or kvm_amd is loaded depending on what CPUID reports.

Sources: Intel SDM Vol. 3C; AMD APM Vol. 2 (doc 24593); Linux arch/x86/kvm/svm/svm.c.

Root Mode and Non-Root Mode

VMX doesn't just add instructions; it splits the processor's entire execution context into two operating modes.

VMX root operation (root mode) is where the hypervisor runs. It is entered by VMXON and returned to by every VM exit. All VMX instructions are available in root mode; the processor behaves like a conventional x86 processor with some additional VMX-specific instructions added.

VMX non-root operation (non-root mode, guest mode) is where the guest OS and its applications run. Entered by VMLAUNCH (first entry on a cleared VMCS) or VMRESUME (all subsequent entries). In non-root mode, certain instructions cause unconditional VM exits — they exit regardless of any control settings. These include every VMX instruction, CPUID, GETSEC, INVD, and XSETBV. Other instructions exit only when the corresponding VM-execution control bit is set in the VMCS (for example, HLT if "HLT exiting" is set, or IN/OUT if the I/O bitmap marks that port).

The key insight is that most guest instructions execute identically in root and non-root mode — the hardware enforces boundaries only at the edges where the guest could interfere with host state, not on every instruction.

AMD SVM uses different terminology: the VMRUN caller is "host," and the #VMEXIT target is "host" as well; the guest runs between VMRUN and #VMEXIT.

Sources: Intel SDM Vol. 3C; https://docs.kernel.org/virt/kvm/x86/nested-vmx.html.

VMCS: The Control Nerve Center (Intel)

For each logical processor running a guest, Intel needs somewhere to record what the guest is allowed to do, what state the guest is in, and why the most recent VM exit happened. That somewhere is the VMCS, the Virtual Machine Control Structure.

A VMCS is a per-vCPU, hardware-managed data structure at most 4 KB in size (the exact size is implementation-defined; read it from IA32_VMX_BASIC[bits 44:32] combined with the VMCS revision identifier in IA32_VMX_BASIC[30:0]). The VMCS is never addressed via memory offsets the way a normal struct would be. The only architectural access mechanism is VMREAD and VMWRITE, which operate on logical 32-bit field encodings defined in the SDM.

The structure has six logical groups:

Group	Role
Guest-state area	Processor state saved on VM exits and loaded on VM entries
Host-state area	Processor state loaded on VM exits (the VMM's landing pad)
VM-execution control fields	What the guest is allowed to do in non-root mode
VM-exit control fields	How exits are handled
VM-entry control fields	How entries are performed
VM-exit information fields	Read-only; records the cause of the most recent exit

The first four bytes of the VMCS in memory hold the VMCS revision identifier (bits 30:0); bit 31 = 1 marks a shadow VMCS. Bytes 4–7 hold the VMX-abort indicator, which is non-zero if a VMX abort (a catastrophic hypervisor error during a VMX transition) occurred.

To make a VMCS active on a logical processor, the VMM issues VMPTRLD with the VMCS physical address. VMPTRLD fails if the revision identifier does not match IA32_VMX_BASIC[30:0]. Once loaded, the VMCS is the "current VMCS" on that processor until another VMPTRLD or VMCLEAR replaces it.

KVM uses a notation worth knowing for nested virtualization: vmcs01 is L0's (the real hypervisor's) VMCS for running L1 (the guest hypervisor); vmcs12 is the VMCS structure that L1 constructs for its own guest L2; vmcs02 is the VMCS that L0 synthesizes to actually run L2 on the hardware.

Sources: Intel SDM Vol. 3C; https://docs.kernel.org/virt/kvm/x86/nested-vmx.html.

VMCB: The AMD Equivalent

AMD's per-guest-vCPU structure is the VMCB, Virtual Machine Control Block. Like the VMCS it is 4 KB aligned; unlike the VMCS it is an ordinary memory structure addressable by offset. VMRUN takes the VMCB's physical address as its operand, loads guest state from it, and executes the guest. On #VMEXIT, the processor saves guest state back to the VMCB and restores host state from the physical address in VM_HSAVE_PA.

The VMCB has two major areas:

Area	Offsets	Contents
Control area	`0x000`–`0x3FF`	Intercept bits, event injection, NPT enable, `nCR3`, ASID, TLB control
State save area	`0x400`–`0xFFF`	Guest register state (CS, SS, GDTR, IDTR, CR0–CR4, EFER, RAX, RSP, RIP, RFLAGS, and more)

The intercept bits in the control area are AMD's equivalent of Intel's VM-execution control fields: each bit makes the corresponding instruction or event cause a #VMEXIT. The EXITCODE field in the control area (at offset 0x40 within the VMCB) records the cause of the most recent #VMEXIT; AMD prefixes these with VMEXIT_ (for example, VMEXIT_NPF for a nested-paging fault).

Sources: AMD APM Vol. 2 (doc 24593); https://www.0x04.net/doc/amd/33047.pdf.

VM-Entry and VM-Exit

A VM-entry is the hardware transition from root operation (or from host mode in SVM) to non-root operation (guest mode). On Intel, the first entry on a fresh VMCS uses VMLAUNCH; all subsequent ones use VMRESUME. The hardware saves host state into the VMCS host-state area, loads guest state from the VMCS guest-state area, and begins executing at the guest RIP recorded there. On AMD, VMRUN performs entry unconditionally with no first/subsequent distinction.

A VM-exit is the reverse: the hardware transitions from non-root operation back to the VMM. Every exit saves guest state into the VMCS guest-state area (or VMCB state save area), loads host state, and resumes the VMM at the host RIP recorded in the host-state area. The VMM then inspects the exit reason to decide what to do.

Intel records the exit reason in the VMCS "VM-exit reason" field, which is 32 bits wide with the basic exit reason in bits 15:0. A representative set from Intel SDM Table C-1:

Basic Reason	Value	Trigger
Exception or NMI	0	Vectored exception matching the exception bitmap
External interrupt	1	Physical interrupt, if "external-interrupt exiting" is set
CPUID	10	Guest executes `CPUID` (unconditional in non-root mode)
HLT	12	Guest executes `HLT`, if "HLT exiting" is set
CR access	28	Guest accesses a control register under interception
I/O instruction	30	`IN`/`OUT` matching the I/O bitmap
RDMSR	31	Guest `RDMSR` matching the MSR bitmap
WRMSR	32	Guest `WRMSR` matching the MSR bitmap
EPT violation	48	GPA access with no valid EPT mapping or a permission violation
EPT misconfiguration	49	EPT entry with an architecturally invalid encoding

KVM surfaces these to userspace via the kvm_run.exit_reason field of struct kvm_run. After KVM_RUN returns to the VMM, exit_reason holds one of the KVM_EXIT_* constants (KVM_EXIT_IO, KVM_EXIT_MMIO, KVM_EXIT_HLT, and so on), and the rest of struct kvm_run carries the exit-specific data.

sequenceDiagram participant G as Guest (non-root mode) participant H as Hardware participant K as KVM (root mode) participant F as VMM userspace F->>K: KVM_RUN ioctl on vcpu fd K->>H: VMLAUNCH / VMRESUME H->>G: execute guest instructions G->>H: controlled instruction / event H->>K: VM-exit (saves guest state, loads host state) K->>K: handle exit in kernel (e.g. emulate MSR) alt KVM can handle in kernel K->>H: VMRESUME (re-enter guest) H->>G: continue guest else KVM returns to userspace K->>F: KVM_RUN returns, kvm_run.exit_reason set F->>F: emulate device, inject interrupt, etc. F->>K: KVM_RUN again end

AMD uses the VMCB EXITCODE field and the #VMEXIT notation. The NPT fault exit code is VMEXIT_NPF, the AMD equivalent of Intel's EPT violation (reason 48).

Sources: Intel SDM Vol. 3C, Table C-1; AMD APM Vol. 2; https://docs.kernel.org/virt/kvm/api.html.

EPT and NPT: Two-Level Address Translation

A guest OS that thinks it controls physical memory is actually manipulating guest-physical addresses, not the real machine's physical addresses. Without hardware help, the hypervisor would need to intercept every guest page-table write to maintain a shadow mapping — that is expensive and fragile. Both Intel and AMD solved this by adding a second level of hardware page tables that the CPU walks automatically.

EPT (Extended Page Tables) is Intel's second-level address translation (SLAT) mechanism, also called two-dimensional paging (2D paging). Activating it requires setting the "Enable EPT" bit in the secondary processor-based VM-execution controls in the VMCS. The EPTP field in the VMCS (Extended-Page-Table Pointer) points to the EPT PML4 root. EPTP[bits 5:3] encodes (levels − 1): value 3 means 4-level EPT; value 4 means 5-level EPT (5-level EPT support is reported in IA32_VMX_EPT_VPID_CAP[bit 7]; 4-level in [bit 6]). An access to a guest-physical address that has no valid EPT mapping, or that violates the entry's permission bits, causes an EPT violation VM-exit (reason 48).

NPT (Nested Page Tables) is AMD's equivalent, also called AMD-RVI (Rapid Virtualization Indexing). Enable it by setting the "Enable Nested Paging" bit in the VMCB control area and supplying a valid nCR3 pointer in that same control area. An access that lacks a valid NPT mapping causes a #VMEXIT with exit code VMEXIT_NPF. One capability difference from Intel EPT: NPT does not support execute-only page permissions.

KVM calls both mechanisms tdp (two-dimensional paging). When tdp is active, KVM's MMU operates in "direct mode," meaning KVM's own page tables directly represent the GPA-to-HPA mapping rather than shadowing the guest's GPA-to-GVA tables.

Sources: Intel SDM Vol. 3C; Intel 5-Level Paging white paper (doc 335252-002); AMD APM Vol. 2; https://kernel.org/doc/Documentation/virtual/kvm/mmu.txt.

vCPU: The Guest's Logical Processor

A vCPU (virtual CPU) is the software abstraction of one logical processor for a guest VM. In KVM, each vCPU is backed by a struct kvm_vcpu (with an architecture-specific extension at vcpu->arch) and is pinned to a dedicated host thread. The KVM documentation is explicit: "vcpu ioctls should be issued from the same thread that was used to create the vcpu."

Creating a vCPU requires two ioctls, issued in order on two different file descriptors:

KVM_CREATE_VCPU = _IO(KVMIO, 0x41), where KVMIO = 0xAE. Issued on the VM file descriptor; the argument is the vCPU ID (which maps to the APIC ID on x86). Returns a new vCPU file descriptor.
mmap(vcpu_fd, 0, KVM_GET_VCPU_MMAP_SIZE) to obtain the struct kvm_run region — the shared memory area through which KVM and the VMM exchange exit information and guest state after each KVM_RUN.

Running the vCPU: KVM_RUN = _IO(KVMIO, 0x80), issued on the vCPU file descriptor with no arguments. The call blocks until a VM exit that KVM cannot handle in the kernel, then returns with kvm_run.exit_reason set.

Firecracker spawns exactly one OS thread per vCPU. The threads start in a Paused state and are resumed via resume_vm() in src/vmm/src/builder.rs once the full VM configuration is complete.

Sources: https://docs.kernel.org/virt/kvm/api.html; Linux include/uapi/linux/kvm.h; Firecracker src/vmm/src/builder.rs.

GPA, HPA, and HVA: Three Address Spaces

Any KVM-based virtual machine involves three distinct address spaces, and keeping them straight is the prerequisite for understanding everything that follows.

GPA (Guest Physical Address) is the address in the guest's physical address space — what the guest OS's page tables ultimately resolve to, and what the guest's memory controller bus would see. The guest believes these are hardware physical addresses.

HPA (Host Physical Address) is the actual machine physical address — what appears on the memory bus of the real hardware. Managed by the host kernel via its own page tables.

HVA (Host Virtual Address) is an address in the VMM userspace process's virtual address space. KVM maps a guest's GPA range to an HVA range via memslots; the host kernel's page tables then resolve HVA to HPA.

There is a fourth space worth naming once: GVA (Guest Virtual Address), the guest linear address before the guest's own page-table walk. The complete translation chain with EPT or NPT active is:

GVA --[guest PT walk]--> GPA --[EPT/NPT hardware walk]--> HPA

KVM's memslot mechanism is what connects GPA to HVA. A memslot is registered with:

KVM_SET_USER_MEMORY_REGION = _IOW(KVMIO, 0x46, struct kvm_userspace_memory_region)

The struct's guest_phys_addr field gives the GPA start; userspace_addr gives the HVA start; memory_size gives the region length. KVM uses this table to translate faulting GPAs (arriving in EPT/NPT violations) to the HVA that the VMM process has backed with mmap.

The genuine confusion point: when KVM documentation or Firecracker source says "the guest's physical memory," it means memory the VMM allocated with mmap in its own process (HVA), registered with KVM_SET_USER_MEMORY_REGION, and accessible to the guest at the GPA range the VMM chose. The guest never sees HVAs. The VMM never sees HPAs. Only the hardware page-table walker sees both.

Sources: https://kernel.org/doc/Documentation/virtual/kvm/mmu.txt; https://docs.kernel.org/virt/kvm/api.html; https://docs.kernel.org/virt/kvm/x86/mmu.html.

Virtqueue: The Data Path Between Guest and Device

A virtqueue is the data-path abstraction between a virtio driver running in the guest and a virtio device implemented in the VMM (or in the host kernel). Every virtio device exposes one or more virtqueues; each queue is an independent bidirectional buffer-passing channel. The network device (virtio-net) uses two: one for transmit, one for receive. The block device (virtio-blk) uses one for request/response pairs.

The canonical format is the split virtqueue (virtio 1.2 spec section 2.7). The guest driver allocates three memory regions at GPA addresses and registers them with the device:

Region	Written by	Alignment
Descriptor Table	Driver	16 bytes
Available Ring	Driver	2 bytes
Used Ring	Device	4 bytes

Each descriptor is a struct virtq_desc (16 bytes): le64 addr (GPA of the buffer), le32 len, le16 flags, le16 next. The flags control chaining (VIRTQ_DESC_F_NEXT = 1), write direction (VIRTQ_DESC_F_WRITE = 2, meaning the device writes to this buffer), and indirection (VIRTQ_DESC_F_INDIRECT = 4).

The available ring (struct virtq_avail) carries head indices of descriptor chains the driver has prepared for the device. The used ring (struct virtq_used) carries (id, len) pairs back from the device when it finishes a chain. Both idx fields are 16-bit unsigned and wrap naturally at 2^16, so there is no end-of-ring sentinel: the driver compares its local shadow of used.idx against the ring's current value to find new completions.

Queue size must be a power of two; the maximum is 32,768 (virtio 1.2 spec section 2.7.1). Firecracker caps every queue at 256.

Sources: OASIS virtio 1.2 CS01, sections 2.7.1, 2.7.5, 2.7.6, 2.7.8; https://docs.oasis-open.org/virtio/virtio/v1.2/cs01/virtio-v1.2-cs01.html.

Virtio-MMIO: The Bus-Free Transport

Virtio-MMIO is the memory-mapped I/O transport for virtio devices. Where virtio-PCI sits behind a PCI bus (with all the discovery, enumeration, and interrupt routing that entails), virtio-MMIO uses a flat set of 32-bit registers starting at a device-specific base address in guest-physical space. Firecracker uses virtio-MMIO exclusively — there is no PCI bus in a Firecracker microVM by default.

Because MMIO has no self-describing discovery mechanism, Firecracker tells the guest about each device by appending virtio_mmio.device=<size>@<baseaddr>:<irq> entries to the kernel command line. The guest kernel must have CONFIG_VIRTIO_MMIO=y and CONFIG_VIRTIO_MMIO_CMDLINE_DEVICES=y compiled in.

The essential registers, from virtio 1.2 spec section 4.2.2:

Offset	Register	Dir	Key detail
`0x000`	MagicValue	RO	Must read `0x74726976` ("virt" in little-endian ASCII)
`0x004`	Version	RO	1 = legacy; 2 = modern (virtio 1.x)
`0x008`	DeviceID	RO	Device type: 1=net, 2=blk, 4=rng, 5=balloon, 19=vsock
`0x010`	DeviceFeatures	RO	32-bit feature page (paged via `DeviceFeaturesSel` at `0x014`)
`0x020`	DriverFeatures	WO	Feature bits accepted by the driver
`0x030`	QueueSel	WO	Selects which virtqueue to configure
`0x034`	QueueNumMax	RO	Maximum queue size for the selected queue
`0x038`	QueueNum	WO	Actual queue size chosen by driver
`0x044`	QueueReady	RW	Write 1 to mark queue live
`0x050`	QueueNotify	WO	Write queue index to notify device of new buffers
`0x060`	InterruptStatus	RO	Bit 0: used-buffer notification; bit 1: config change
`0x064`	InterruptACK	WO	Driver writes same bits to acknowledge
`0x070`	Status	RW	Device status byte
`0x080`/`0x084`	QueueDescLow/High	WO	Descriptor Table GPA (64-bit, split across two 32-bit writes)
`0x090`/`0x094`	QueueDriverLow/High	WO	Available Ring GPA
`0x0a0`/`0x0a4`	QueueDeviceLow/High	WO	Used Ring GPA
`0x0fc`	ConfigGeneration	RO	Incremented on every config space change
`0x100+`	Config	RW	Device-specific configuration space

Firecracker hardcodes Version = 2 (modern). The guest must negotiate VIRTIO_F_VERSION_1 (feature bit 32) during the feature negotiation handshake or the device will refuse to become live.

Guest writes to QueueNotify are how the driver tells the device that new descriptor chains are available. That write hits the virtio-MMIO register range, which Firecracker has registered as an MMIO trap, producing an KVM_EXIT_MMIO VM exit — unless ioeventfd is in play, in which case the write is handled in the kernel without a userspace round-trip.

Sources: OASIS virtio 1.2 CS01, section 4.2.2; Linux include/uapi/linux/virtio_mmio.h; Firecracker src/vmm/src/devices/virtio/transport/mmio.rs.

irqfd and ioeventfd: Eliminating the Round-Trip

The naive virtio implementation has a VM exit on every queue notification and every interrupt injection. For a network device handling thousands of packets per second, that cost is substantial. KVM provides two mechanisms that short-circuit the most common paths.

ioeventfd binds a guest MMIO or PIO write to a host eventfd. The VMM registers the binding with KVM_IOEVENTFD (issued on the VM file descriptor), specifying the GPA (or port) and the eventfd file descriptor. When the guest writes to QueueNotify at offset 0x050, KVM signals the eventfd directly — still in the kernel, without returning to userspace. The VMM's worker thread, sleeping in epoll_wait on that eventfd, wakes up and processes the virtqueue. No KVM_RUN return, no context switch to the VMM, no kvm_run decoding.

irqfd is the reverse: it binds a host eventfd to a guest interrupt (GSI). The VMM registers the binding with KVM_IRQFD (also on the VM file descriptor). When a host component — say, the TAP device's receive path after draining a packet into a virtqueue used ring — writes to that eventfd, KVM injects the corresponding interrupt into the guest without any VMM userspace involvement. The eventfd write, the interrupt injection, and the VM re-entry all happen in kernel space.

sequenceDiagram
    participant G as Guest driver
    participant K as KVM kernel
    participant W as VMM worker thread
    participant T as TAP / host network

    G->>K: write QueueNotify (MMIO at 0x050)
    K->>K: ioeventfd: signal eventfd (no VM exit to userspace)
    K->>W: eventfd readable (epoll wakeup)
    W->>W: process virtqueue descriptors
    W->>T: send packets via TAP write
    T->>K: receive path: signal irqfd eventfd
    K->>K: inject virtio-net interrupt into guest
    K->>G: guest interrupt handler runs

The VMM userspace thread is in the fast path for packet processing but never for the notification signaling. That asymmetry is what makes virtio usable at network speeds.

Sources: https://docs.kernel.org/virt/kvm/api.html (KVM_IRQFD, KVM_IOEVENTFD); Linux include/uapi/linux/kvm.h.

MMIO and PIO: Two I/O Address Spaces

x86 exposes device registers through two distinct address spaces, and a VMM needs to handle both.

MMIO (Memory-Mapped I/O) maps device registers into the guest-physical address space. A guest load or store to an MMIO address produces an EPT/NPT fault (or a shadow-paging page-not-present fault) and a VM exit with reason 48 (EPT violation) or its AMD equivalent. KVM decodes the access and sets kvm_run.exit_reason = KVM_EXIT_MMIO, with kvm_run.mmio.phys_addr and kvm_run.mmio.data filled in. The VMM decodes the GPA into a device register and emulates the read or write. Firecracker uses MMIO for all virtio device registers via the virtio-MMIO transport.

PIO (Port I/O, x86-only) uses a 16-bit I/O port address space separate from physical memory, accessed by the guest with IN and OUT instructions. A guest IN/OUT to a port listed in the VMCS I/O bitmap causes a VM exit with reason 30. KVM returns KVM_EXIT_IO with kvm_run.io.port, kvm_run.io.direction (KVM_EXIT_IO_IN or KVM_EXIT_IO_OUT), and kvm_run.io.size. The VMM dispatches the port to the appropriate emulated register. Firecracker uses PIO for the serial console at port 0x3f8 (COM1) via 8250 UART emulation.

In Firecracker, both MMIO and PIO emulation dispatch through the BusDevice trait in src/vmm/src/devices/bus.rs, which presents a uniform read() / write() interface. The VMM event loop checks kvm_run.exit_reason after each KVM_RUN return and routes KVM_EXIT_IO and KVM_EXIT_MMIO exits to the appropriate BusDevice handler.

Sources: Intel SDM Vol. 3C (I/O exit handling, EPT violation); Firecracker src/vmm/src/devices/bus.rs.

MicroVM

A microVM is a hardware virtual machine paired with a minimal VMM: it provides the strong isolation guarantees of a VM — its own guest-physical memory, its own vCPUs, its own virtual devices — while deliberately omitting the firmware stack, PCI bus, ACPI interpreter, option ROMs, BIOS, and full device model that conventional hypervisors supply. The deliberate omissions are not compromises; they are the design. Each removed component is attack surface that does not need to be defended and initialization work that does not need to be done.

Firecracker's definition of microVM is specific: a fixed device model (virtio-net, virtio-blk, virtio-vsock, virtio-balloon, virtio-rng, a serial console, and a real-time clock), no PCI bus by default, no BIOS, and a boot path that uses the Linux boot protocol to jump directly into a kernel image supplied by the operator. The guest kernel must include CONFIG_VIRTIO_MMIO=y and CONFIG_VIRTIO_MMIO_CMDLINE_DEVICES=y. There is no firmware phase at all: the VMM writes boot parameters into guest memory and sets the vCPU's initial registers to the entry point. Time to a running Linux kernel is measured in tens of milliseconds on current hardware.

Cloud Hypervisor adds optional PCIe support; crosvm targets Chrome OS and adds GPU passthrough.

VMM: Virtual Machine Monitor

The VMM (Virtual Machine Monitor, also called hypervisor) is the software that creates and manages virtual machines. In a KVM-based system the VMM is a userspace process. It opens /dev/kvm to get the system file descriptor, issues KVM_CREATE_VM to get a per-VM file descriptor, maps guest memory with KVM_SET_USER_MEMORY_REGION, creates vCPUs with KVM_CREATE_VCPU, loads the guest kernel image, and then drives the event loop by issuing KVM_RUN on each vCPU file descriptor and handling the exits that KVM returns to userspace.

KVM provides the hardware-assisted execution path; the VMM owns device emulation, memory layout, interrupt routing, and the VM lifecycle (create, pause, snapshot, restore, destroy). Neither alone constitutes a working virtual machine; together they split the responsibilities cleanly.

Firecracker is a VMM. It is a single-purpose Rust process that exposes a REST API over a Unix socket and manages exactly one microVM per process instance. Its device model is fixed at compile time; it does not support runtime device plugins. The REST API surface (PUT /machine-config, PUT /drives, PUT /network-interfaces, PUT /boot-source, PUT /actions {"action_type": "InstanceStart"}) is how an orchestrator builds the machine before it starts.

The Type-1 (bare-metal) / Type-2 (hosted) hypervisor taxonomy does not map cleanly onto KVM. Conventionally KVM is Type-2 because it runs under a Linux host kernel, but KVM runs in kernel space and the guest runs at hardware speed — sometimes called Type-1.5.

Sources: https://docs.kernel.org/virt/kvm/api.html; Firecracker design document: https://github.com/firecracker-microvm/firecracker/blob/main/docs/design.md.

Jailer: Hardening the Launch Environment

Safety note: the jailer binary requires CAP_SYS_ADMIN and root-level access to cgroup hierarchies and namespace facilities. Running it changes the mount namespace and cgroup membership of the child process. Review the flags carefully before use in a production environment.

The jailer (jailer binary, shipped alongside firecracker) is the recommended production launcher for Firecracker. It applies a sequence of privilege-reduction and isolation steps before it execs into the Firecracker binary, so that Firecracker itself runs in an already-hardened environment. Running Firecracker directly, without the jailer, is supported but leaves all of these mitigations to the operator.

The jailer performs its setup in this order:

Chroot via pivot_root: Creates a new mount namespace with unshare(), then calls pivot_root() to confine the process to <chroot_base>/<exec_file_name>/<id>/root (the default base is /srv/jailer). All file paths the Firecracker process can reach are then relative to that root.
Network namespace: If --netns <path> is supplied, the jailer calls setns(fd, CLONE_NEWNET) to join a pre-created network namespace, isolating the process's network view to whatever interfaces exist in that namespace.
PID namespace: If --new-pid-ns is passed, the jailer creates a new PID namespace via clone(CLONE_NEWPID), making the Firecracker process PID 1 in its own namespace.
Cgroup isolation: Writes the process into a cgroup hierarchy at <cgroup_base>/<parent_cgroup>/<id>/ using v1 controllers (cpu, cpuset, memory, pids) or cgroup v2 via --cgroup-version. Resource limits applied by the cgroup constrain the Firecracker process, not just the guest.
Privilege drop: Calls setuid(uid) / setgid(gid) to drop to the caller-supplied UID and GID. Both --uid and --gid are mandatory; the jailer will not proceed without them.
Exec: exec()s the Firecracker binary, automatically appending --id, --start-time-us, and --start-time-cpu-us to the argument list, plus any operator-supplied arguments passed after --.

After step 6, jailer is gone from the process tree. The process is now firecracker, running under the requested UID/GID, confined to the chroot, and constrained by the cgroup.

Source: https://github.com/firecracker-microvm/firecracker/blob/main/docs/jailer.md.

Seccomp: Limiting the System Call Surface

Safety note: installing a seccomp-BPF filter with SECCOMP_SET_MODE_FILTER and a KillProcess default action is irreversible for the duration of the process. The filter takes effect immediately after the prctl call; a mistake in the filter terminates the process on the next unallowed syscall.

Seccomp (secure computing mode) is a Linux kernel mechanism that restricts which system calls a process (or thread) may execute. In Firecracker, per-thread seccomp-BPF filters are installed with SECCOMP_SET_MODE_FILTER on three categories of threads, each with its own allowlist:

Thread	When the filter is installed
`api`	Right before the HTTP server starts
`vmm`	Right before vCPU threads begin executing guest code
`vcpu`	Right before executing guest code

Filters are expressed as JSON files mapping thread category names ("vmm", "api", "vcpu") to filter objects, each with fields default_action, filter_action, and filter (an array of syscall rule objects with optional argument comparators). The seccompiler-bin tool compiles these JSON files into serialized BPF bytecode at Firecracker build time; the bytecode is embedded in the binary. Operators who need to extend the allowlist can supply a custom filter at launch with --seccomp-filter.

Supported default_action and filter_action values: Allow, Errno(u32), Kill (kill thread), KillProcess, Log, Trace(u32), Trap. Argument comparators available in rule objects: eq, ge, gt, lt, masked_eq, ne. Each rule matches a syscall number and optional per-argument conditions.

The net effect is that each Firecracker thread type can only call the system calls it actually needs. A guest that achieves code execution in the VMM through a device-emulation bug still cannot call open, execve, or socket from the vmm thread if those syscalls are not on the vmm allowlist.

Sources: https://github.com/firecracker-microvm/firecracker/blob/main/docs/seccomp.md; https://github.com/firecracker-microvm/firecracker/blob/main/docs/seccompiler.md.

MMDS: Metadata Without a Network Path

MMDS (microVM Metadata Service) is Firecracker's analogue of the EC2 Instance Metadata Service. An operator populates a key-value store in the VMM (via PUT /mmds on the API socket); the guest retrieves it over HTTP without any host network path, TAP device involvement, or external routing.

The transport is a designated virtio-net device. The guest sends normal HTTP/TCP/IPv4 packets addressed to the MMDS IP; Firecracker intercepts them on the designated interface before they reach the TAP device and answers them directly from the in-process store.

Default endpoint: 169.254.169.254 (link-local; reconfigurable via ipv4_address in PUT /mmds/config). The MAC address of the MMDS interface is hardcoded as 06:01:23:45:67:01.

Two protocol versions are supported:

Version	Protocol
MMDS v1 (deprecated)	Stateless; accepts `GET` and `PUT`; returns HTTP 200 even without a token
MMDS v2	Session-based, modeled on AWS IMDSv2; guest first issues `PUT /latest/api/token` with an `X-metadata-token-ttl-seconds` header (value 1–21600 seconds); subsequent `GET` requests must include the returned token in `X-metadata-token`; a missing or invalid token returns HTTP 401

Snapshot behavior: the MMDS version, the designated interface binding, and the configured IP address are preserved across snapshot/restore. The data store contents are not preserved — by design, to prevent per-VM secrets from leaking into snapshot-derived clones.

Sources: https://github.com/firecracker-microvm/firecracker/blob/main/docs/mmds/mmds-user-guide.md; https://github.com/firecracker-microvm/firecracker/blob/main/docs/mmds/mmds-design.md.

Snapshot and Restore: Suspending a Running VM

A Firecracker snapshot is a point-in-time capture of all microVM state sufficient to resume execution on a compatible host without rebooting. The guest resumes from the exact instruction it was at when the snapshot was taken, with the same memory contents, device state, and vCPU register file.

Taking a snapshot requires a PUT /snapshot/create request to the API socket. It always produces exactly two output files:

File	Field	Contents
State file	`snapshot_path`	Device emulation state, KVM VM state, and per-vCPU register state; serialized as `bitcode` inside a `Snapshot<MicrovmState>` wrapper with a 64-bit CRC appended
Memory file	`mem_file_path`	Guest memory pages — all of them (full snapshot) or only pages dirtied since the last snapshot (diff snapshot)

Two snapshot types exist:

Full: The memory file receives a complete copy of all guest memory. The state file is always a full state dump regardless of snapshot type.
Diff (incremental, developer preview): The memory file receives only pages dirtied since the last full or diff snapshot. Firecracker determines dirty pages by unioning KVM's own dirty bitmap (retrieved via KVM_GET_DIRTY_LOG) with its internal AtomicBitmap tracking virtio queue memory. If track_dirty_pages was false when the VM was created, the fallback is mincore(2), which requires swap to be disabled on the host.

Restoring from a snapshot requires PUT /snapshot/load. Guest memory can be backed by either a MAP_PRIVATE file mapping (copy-on-write; pages are faulted in from the memory file on demand) or a userfaultfd handler (UFFD backend, enabling tiered memory backends for clone-on-demand patterns like serverless cold starts).

What snapshot/restore does not preserve: active network packet flow, open vsock connections, MMDS data store contents, metrics and logging configuration, and wall-clock time. By default, Firecracker does not advance wall-clock time on restore; set clock_realtime: true in PUT /snapshot/load (added in v1.16.0) to advance the guest clock to match elapsed real time at restore. Without it, the guest must re-synchronize the clock via NTP or equivalent.

Compatibility constraints: the same CPU architecture is required on the restoring host; cross-vendor restore (Intel snapshot restored on AMD) is explicitly warned against. The snapshot format's MAJOR version must match the running Firecracker binary.

Sources: https://github.com/firecracker-microvm/firecracker/blob/main/docs/snapshotting/snapshot-support.md; Firecracker src/vmm/src/persist.rs; src/vmm/src/vstate/memory.rs.

Sources and Further Reading

Intel SDM Vol. 3C (VMX, VMCS, EPT, VM-entry/exit, I/O exit handling): https://cdrdv2-public.intel.com/671506/326019-sdm-vol-3c.pdf
Intel 5-Level Paging white paper, doc 335252-002 (EPT levels, EPTP encoding): https://cdrdv2-public.intel.com/671442/5-level-paging-white-paper.pdf
Intel HAXM vmx.h (VM-exit reason code reference): https://github.com/intel/haxm/blob/master/core/include/vmx.h
AMD APM Vol. 2, doc 24593 (SVM, VMCB layout, NPT, #VMEXIT codes): https://www.0x04.net/doc/amd/33047.pdf
Linux KVM API documentation (KVM_CREATE_VM, KVM_CREATE_VCPU, KVM_RUN, KVM_SET_USER_MEMORY_REGION, KVM_IRQFD, KVM_IOEVENTFD, KVM_GET_DIRTY_LOG): https://docs.kernel.org/virt/kvm/api.html
Linux KVM MMU documentation (GPA/HPA/HVA, memslots, shadow vs. tdp mode): https://kernel.org/doc/Documentation/virtual/kvm/mmu.txt
Linux KVM x86 MMU: https://docs.kernel.org/virt/kvm/x86/mmu.html
Linux KVM nested VMX (vmcs01/vmcs12/vmcs02 notation): https://docs.kernel.org/virt/kvm/x86/nested-vmx.html
Linux include/uapi/linux/kvm.h (ioctl numbers, struct kvm_run, KVM_EXIT_* constants): https://raw.githubusercontent.com/torvalds/linux/master/include/uapi/linux/kvm.h
Linux include/uapi/linux/virtio_mmio.h (register offsets): https://github.com/torvalds/linux/blob/master/include/uapi/linux/virtio_mmio.h
Linux include/uapi/linux/virtio_config.h (feature bits including VIRTIO_F_VERSION_1): https://github.com/torvalds/linux/blob/master/include/uapi/linux/virtio_config.h
OASIS virtio 1.2 Committee Specification CS01 (split virtqueue, MMIO transport, feature negotiation): https://docs.oasis-open.org/virtio/virtio/v1.2/cs01/virtio-v1.2-cs01.html
Firecracker design document: https://github.com/firecracker-microvm/firecracker/blob/main/docs/design.md
Firecracker jailer documentation: https://github.com/firecracker-microvm/firecracker/blob/main/docs/jailer.md
Firecracker seccomp documentation: https://github.com/firecracker-microvm/firecracker/blob/main/docs/seccomp.md
Firecracker seccompiler documentation: https://github.com/firecracker-microvm/firecracker/blob/main/docs/seccompiler.md
Firecracker MMDS user guide: https://github.com/firecracker-microvm/firecracker/blob/main/docs/mmds/mmds-user-guide.md
Firecracker MMDS design document: https://github.com/firecracker-microvm/firecracker/blob/main/docs/mmds/mmds-design.md
Firecracker snapshot support documentation: https://github.com/firecracker-microvm/firecracker/blob/main/docs/snapshotting/snapshot-support.md
Firecracker src/vmm/src/builder.rs (vCPU thread model, resume_vm()): https://github.com/firecracker-microvm/firecracker/blob/main/src/vmm/src/builder.rs
Linux TUN/TAP documentation: https://docs.kernel.org/networking/tuntap.html