Chapter 13: Firecracker Architecture

What does one firecracker process contain, and how are its parts connected? The process must serve a REST API, run one or more guest vCPU loops, emulate a minimal device set, and do all of it while keeping the guest's influence over the host kernel as narrow as a handful of ioctls. This chapter traces the architecture that satisfies those constraints.

One Process, One MicroVM

Firecracker's design document states the invariant plainly: "Each Firecracker process encapsulates one and only one microVM." There is no multiplexing, no VM table, no daemon that routes requests to a pool. Running N microVMs means N firecracker processes. This is a deliberate constraint, not an implementation shortcut. A single process failure cannot cascade to a neighboring VM; the host kernel's process isolation is part of the security boundary, complementing the hardware VMX enforcement and the jailer's chroot.

Inside that one process, three categories of OS thread do distinct work.

flowchart TD proc["firecracker process"] api["fc_api thread\n(API server, control plane)"] vmm["VMM thread\n(event loop, device emulation)"] vcpu0["fc_vcpu 0\n(KVM_RUN loop)"] vcpu1["fc_vcpu 1\n(KVM_RUN loop)"] kvm["/dev/kvm"] proc --> api proc --> vmm proc --> vcpu0 proc --> vcpu1 api -->|"mpsc + EventFd"| vmm vmm -->|"mpsc VcpuEvent"| vcpu0 vmm -->|"mpsc VcpuEvent"| vcpu1 vcpu0 -->|"KVM_RUN ioctl"| kvm vcpu1 -->|"KVM_RUN ioctl"| kvm

The API thread is named fc_api via thread::Builder::new().name("fc_api"). It runs the HTTP server over a Unix domain socket and handles every configuration request, but it never touches guest execution directly. The VMM thread is the hub: it owns the KVM VmFd, all device backends, the MMDS metadata service, and the EventManager epoll loop. The vCPU threads — one per guest CPU core, named fc_vcpu 0, fc_vcpu 1, and so on using format!("fc_vcpu {}", self.kvm_vcpu.index) — do nothing but call KVM_RUN and service the exits that return from it. The vCPU threads are the part that runs untrusted code, and Firecracker's security design treats them accordingly: they receive the most restrictive seccomp-BPF filter of any thread in the process.

The REST API Over a Unix Socket

Every Firecracker configuration operation arrives through a single Unix domain socket. The socket is never TCP, always a SOCK_STREAM Unix socket. A common convention is /run/firecracker.socket; the --api-sock command-line flag sets the path explicitly. When the jailer is in use, the socket appears inside the chroot at <chroot_base>/<exec_file_name>/<id>/root/<api-sock>, which from the host looks like /srv/jailer/firecracker/<id>/root/run/firecracker.socket.

Firecracker does not use a standard Rust HTTP crate. It uses micro_http, an in-house crate maintained by the firecracker-microvm GitHub organization at https://github.com/firecracker-microvm/micro-http. It is referenced in Cargo.toml as a git dependency, not a crates.io package. micro_http implements HTTP/1.0 and HTTP/1.1 with no chunked transfer encoding and no compression; it enforces a configurable maximum request body size via server.set_payload_max_size(limit), returning HTTP 400 on violation. The API socket is available within at most 8 CPU-milliseconds of process start — wall-clock typically around 12 ms, with observed range 6–60 ms — because the API thread and socket setup complete before the VMM thread touches KVM.

The API conforms to an OpenAPI specification kept in-tree at src/firecracker/swagger/firecracker.yaml. SPECIFICATION.md says: "The API socket is always available and the API conforms to the in-tree Open API specification." All resources are stable across patch versions; the specification is the normative contract.

How the API Thread Talks to the VMM Thread

The two threads share two std::sync::mpsc channels and one EventFd:

API thread                             VMM thread
   |                                       |
   |--- Sender<ApiRequest> ------------->  |   (boxed VmmAction)
   |--- EventFd (api_event_fd).write() --> |   (wake-up signal)
   |<-- Sender<ApiResponse> -------------- |   (reply)

ApiServer::new() takes three arguments: an mpsc::Sender<ApiRequest> to forward requests into, an mpsc::Receiver<ApiResponse> to read replies from, and the EventFd that wakes the VMM's event loop. When a request arrives on the socket, the API thread sends the boxed action down the channel and writes to the EventFd. The VMM thread is blocking in its EventManager epoll loop; the EventFd write wakes it, and it drains the request receiver synchronously. The reply comes back on the reverse channel. From the API caller's perspective this looks like a synchronous RPC; under the hood it is two decoupled threads producing request-response semantics through a channel pair.

A slow API request — configuring 32 block devices before boot, for example — does not pause a running vCPU. The two threads are independent.

The VMM Thread: Events and Devices

The VMM thread owns the in-memory Vmm struct and the KvmVm it contains. Its central primitive is EventManager, an epoll-based event loop. Every device backend and the VMM itself register as event subscribers. When a vCPU exits with KVM_EXIT_MMIO — the exit reason is 6 in <linux/kvm.h> — the exit delivers the MMIO address and data through the kvm_run struct; the vCPU thread that took the exit dispatches directly to the device handler via mmio_bus.read() or mmio_bus.write() on the Peripherals struct attached to that vCPU, without waking the VMM thread.

The devices the VMM thread owns are the five virtio backends — Net, Block, Vsock, Balloon, and Entropy — plus the Microvm Metadata Service and the legacy device model. The legacy model is minimal: an i8042 PS/2 controller (used on x86-64 exclusively for CPU reset signaling) and a serial UART, both implemented via the vm-superio crate. The UART is the 16550A-compatible Serial struct; the i8042 is the I8042Device struct. Neither of these is in the hot path once the guest has booted; they exist to provide the reset mechanism and, when enabled, a serial console for debugging.

When a vCPU thread finishes its work — either because guest execution completed or because it received a shutdown event — it writes to an exit_evt: EventFd that was passed at construction time as Vcpu::new(..., exit_evt). The VMM's EventManager subscribes to this fd via vcpus_exit_evt(). That is how the VMM detects an unexpected vCPU exit without polling.

vCPU Threads: The KVM_RUN Loop

KvmVm::start_vcpus() calls vcpu.start_threaded(...) for each vCPU. Inside start_threaded, the spawn looks like:

thread::Builder::new()
    .name(format!("fc_vcpu {}", self.kvm_vcpu.index))
    .spawn(move || { ... })

vCPU threads start in a paused state. Vmm::resume_vm() must be called to release them into guest execution. That call is the hard boundary between the pre-boot phase and the running phase; before it, configuration is still mutable.

Inside the vCPU thread's loop, the call is self.kvm_vcpu.fd.run(), which issues the KVM_RUN vcpu ioctl — _IO(KVMIO, 0x80) — on the vcpu fd. The ioctl does not return until the guest causes a VM exit. The exit reason lives in the kvm_run struct that KVM maps into the VMM's address space via mmap on the vcpu fd; the mapping size comes from KVM_GET_VCPU_MMAP_SIZE (_IO(KVMIO, 0x04)). Firecracker reads this through kvm-ioctls's KvmRunWrapper, which exposes the struct fields safely from Rust.

Firecracker's handle_kvm_exit() dispatches on a VcpuExit enum (from kvm-ioctls) that covers the following exits:

Exit kvm.h constant Action
MmioRead(addr, data) KVM_EXIT_MMIO (6) Read device register, fill data buffer
MmioWrite(addr, data) KVM_EXIT_MMIO (6) Write device register
SystemEvent(RESET) KVM_EXIT_SYSTEM_EVENT (24) Return VcpuEmulation::Stopped
SystemEvent(SHUTDOWN) KVM_EXIT_SYSTEM_EVENT (24) Return VcpuEmulation::Stopped
FailEntry KVM_EXIT_FAIL_ENTRY (9) Log and stop
InternalError (internal) Log and stop
Hlt KVM_EXIT_HLT (5) UnhandledKvmExit error
Shutdown KVM_EXIT_SHUTDOWN (8) UnhandledKvmExit error

KVM_EXIT_HLT and KVM_EXIT_SHUTDOWN are treated as errors rather than graceful shutdowns because Firecracker expects the guest to use KVM_EXIT_SYSTEM_EVENT for clean power-off. A HLT that reaches the VMM means no device has consumed the halt — something unexpected happened.

To interrupt a running vCPU from another thread, Firecracker calls self.kvm_vcpu.fd.set_kvm_immediate_exit(1). This sets the immediate_exit field in the shared kvm_run struct to 1, which causes the next (or current) KVM_RUN to return with EINTR. The vCPU thread then drains its command channel — an mpsc::Receiver<VcpuEvent> — and acts on whatever the VMM thread sent (Pause, Resume, SaveState, and so on).

The KVM Ioctl Hierarchy

The three fd types and their key ioctls form a strict hierarchy. You cannot issue a vCPU ioctl without first having a VM fd; you cannot have a VM fd without first having opened /dev/kvm.

Note: The following sequence opens /dev/kvm and creates kernel-managed resources. Do not run it on a shared machine; it requires read-write access to /dev/kvm and the CAP_SYS_ADMIN capability (or membership in the kvm group on most distributions).

Level fd Key ioctl Encoding Purpose
System /dev/kvm KVM_CREATE_VM _IO(KVMIO, 0x01) Returns a VM fd
VM VM fd KVM_SET_USER_MEMORY_REGION _IOW(KVMIO, 0x46, ...) Maps host memory as guest RAM
VM VM fd KVM_CREATE_VCPU _IO(KVMIO, 0x41) Returns a vcpu fd per core
vCPU vcpu fd KVM_RUN _IO(KVMIO, 0x80) Enters guest; returns on each exit

KVM_GET_API_VERSION returns 12; this value has been stable since KVM's introduction and Firecracker checks it at startup.

KVM_SET_USER_MEMORY_REGION takes a kvm_userspace_memory_region struct:

struct kvm_userspace_memory_region { __u32 slot; /* bits 0-15: slot id; bits 16-31: address space id */ __u32 flags; /* KVM_MEM_LOG_DIRTY_PAGES = (1<<0), KVM_MEM_READONLY = (1<<1) */ __u64 guest_phys_addr; __u64 memory_size; /* bytes */ __u64 userspace_addr; /* start of the host mmap'd region */ };

Slots may not overlap in guest physical address space. Setting KVM_MEM_LOG_DIRTY_PAGES in flags enables the dirty-page bitmap that diff snapshots depend on — a point the state machine section returns to.

The rust-vmm Crates

Firecracker is not a monolith built from scratch. It is assembled from a set of shared crates maintained under the rust-vmm GitHub organization, plus one in-house crate. The versions pinned in src/vmm/Cargo.toml are:

Crate Version Role
kvm-bindings 0.14.0 FFI structs from <linux/kvm.h>: kvm_run, kvm_regs, kvm_sregs, kvm_userspace_memory_region, generated by rust-bindgen
kvm-ioctls 0.24.0 Safe wrappers: Kvm, VmFd, VcpuFd, DeviceFd, KvmRunWrapper (mmap of kvm_run)
vm-memory 0.17.1 GuestAddress GPA newtype, GuestMemoryMmap, GuestRegionMmap; dirty bitmap tracking via backend-bitmap feature
vmm-sys-util 0.15.0 EventFd, epoll wrappers, ioctl macro families
linux-loader 0.13.2 Loads vmlinux ELF on x86-64 and Image PE on aarch64; writes boot params to the zero page
vm-allocator 0.1.4 AddressAllocator for MMIO ranges; IdAllocator for device IDs
vm-superio 0.8.1 Serial (16550A UART), I8042Device (PS/2 / CPU reset), Rtc (PL031, aarch64 only)
vhost 0.15.0 vhost-user frontend for offloading virtio-net or virtio-fs to external backends
vm-fdt 0.3.0 Flattened Device Tree blob generation (aarch64 only)
micro_http git In-house HTTP/1.x over Unix socket; not published to crates.io

A few of these are worth examining in more detail because they encode design decisions that affect everything built on top of them.

kvm-ioctls and kvm-bindings

kvm-bindings is what you use when you need to pass a struct into a KVM ioctl: it provides the C types from <linux/kvm.h> as Rust FFI structs, generated by rust-bindgen from the kernel headers. kvm-ioctls wraps those into safe Rust: the Kvm struct wraps /dev/kvm; VmFd wraps the VM fd; VcpuFd wraps the vcpu fd. VcpuFd::run() is the KVM_RUN call. VcpuFd::get_regs() and set_regs() read and write kvm_regs; get_sregs() and set_sregs() handle kvm_sregs. VmFd::register_irqfd() and register_ioeventfd() wire up the kernel-side interrupt and I/O notification mechanisms without needing a VM exit.

vm-memory

GuestAddress is a newtype over u64 representing a guest physical address (GPA). The separation matters: the host never accidentally uses a GPA as a host virtual address (HVA). GuestMemoryMmap is the concrete backed-by-mmap type; it holds a collection of GuestRegionMmap objects, each mapping a contiguous GPA range to a MmapRegion on the host. Cross-region reads and writes — a buffer that straddles two memory regions — are handled transparently by the GuestMemory trait. The backend-bitmap Cargo feature (enabled by Firecracker) adds per-region dirty tracking; this is the userspace side of the dirty-page mechanism that KVM_MEM_LOG_DIRTY_PAGES enables in the kernel.

The Virtio Queue Firecracker Does Not Borrow

One notable absence from the crate list: Firecracker does not use the virtio-queue crate from rust-vmm. It reimplements the VIRTIO 1.2 split virtqueue in src/vmm/src/devices/virtio/queue.rs, with the copyright notice acknowledging both Amazon and the original Chromium OS authors. The internal Queue struct conforms to OASIS VIRTIO 1.2 section 2.7:

Part Alignment Size
Descriptor Table 16 bytes 16 × Queue Size
Available Ring 2 bytes 6 + 2 × Queue Size
Used Ring 4 bytes 6 + 8 × Queue Size

Queue Size must be a power of 2; the maximum is 32,768. The virtq_desc struct is 16 bytes: an le64 guest-physical address, an le32 length, an le16 flags field (VIRTQ_DESC_F_NEXT=1, VIRTQ_DESC_F_WRITE=2, VIRTQ_DESC_F_INDIRECT=4), and an le16 next-descriptor index. Firecracker's internal implementation includes formal verification with the Kani model checker and custom helpers — prepare_kick() and try_enable_notification() — for interrupt suppression. The decision to maintain a fork rather than adopt the shared crate keeps those guarantees internal and avoids taking a dependency on a crate whose API surface Firecracker does not fully control.

The VIRTIO MMIO transport register map sets the magic value "virt" at offset 0x000, version 2 at offset 0x004, and the QueueNotify kick register at offset 0x050.

The Pre-Boot / Running State Machine

A Firecracker process is either configuring a VM or running one. It cannot do both. This constraint is not enforced by convention; it is enforced by two distinct controller objects that replace each other at boot time.

The REST endpoint GET / returns an InstanceInfo struct whose state field is one of three strings: "Not started", "Running", or "Paused". These states determine which operations are legal.

stateDiagram-v2 [*] --> NotStarted : process starts NotStarted --> Running : "PUT /actions InstanceStart" NotStarted --> Running : "PUT /snapshot/load LoadSnapshot" Running --> Paused : "PATCH /vm {state: Paused}" Paused --> Running : "PATCH /vm {state: Resumed}" Paused --> [*] : process exit Running --> [*] : guest shutdown / process exit

PrebootApiController

Before InstanceStart, the VMM thread handles requests through PrebootApiController. Every configuration mutation goes through this controller: ConfigureBootSource, InsertBlockDevice, InsertNetworkDevice, InsertPmemDevice, PutCpuConfiguration, SetMmdsConfiguration, SetVsockDevice, SetEntropyDevice, SetBalloonDevice, UpdateMachineConfiguration, and LoadSnapshot. The transition trigger is StartMicroVm (mapped to PUT /actions with action_type: InstanceStart). Once that action lands, PrebootApiController hands off to RuntimeApiController and is never consulted again.

The MachineConfig that the pre-boot controller accepts has these fields:

Field Type Default Constraint
vcpu_count u8 1 1–32; must be 1 or even if SMT enabled; SMT unsupported on aarch64
mem_size_mib usize 128 > 0; must be a multiple of 2 if using 2 MiB huge pages; must be ≥ balloon target
smt bool false x86-64 only
track_dirty_pages bool false Required for diff snapshots
huge_pages enum None None or Hugetlbfs2M

track_dirty_pages: true causes Firecracker to pass KVM_MEM_LOG_DIRTY_PAGES in the flags field of kvm_userspace_memory_region when registering DRAM slots. It cannot be toggled after boot.

RuntimeApiController

After InstanceStart, requests go to RuntimeApiController. The mutation operations available here are narrower: Pause, Resume, CreateSnapshot, UpdateBlockDevice, UpdateNetworkInterface (rate-limiter changes only), UpdateBalloon, UpdateBalloonStatistics, SendCtrlAltDel (x86-64 only), and a set of read operations: GetBalloonStats, GetFullVmConfig, GetMMDS, PatchMMDS. Attempting a pre-boot-only action in this state returns VmmActionError::OperationNotSupportedPostBoot: "The requested operation is not supported after starting the microVM."

CreateSnapshot is only valid in Paused state. The caller first sends PATCH /vm with { "state": "Paused" } — which stops returning to the event loop, blocking device emulation — and then PUT /snapshot/create. During pause, the vCPU deadlock detection timeout is 30 seconds (RECV_TIMEOUT_SEC). A PATCH /vm with { "state": "Resumed" } lifts the pause.

Boot Sequence: builder.rs

The implementation of StartMicroVm lives in src/vmm/src/builder.rs, in build_microvm_for_boot. The sequence inside that function:

  1. Validate that a kernel config is present (else MissingKernelConfig error).
  2. allocate_guest_memory() — creates the GuestMemoryMmap, one region per DRAM slot.
  3. Kvm::new() — opens /dev/kvm; create KvmVm; generate VcpuFd objects via KVM_CREATE_VCPU.
  4. Register DRAM regions with KVM via KVM_SET_USER_MEMORY_REGION.
  5. Create DeviceManager.
  6. Load the kernel image via linux-loader. Firecracker requires an uncompressed vmlinux ELF on x86-64, not a bzImage. On x86-64, linux-loader uses the 64-bit protected-mode entry point directly, bypassing the real-mode decompressor entirely. On aarch64, it loads a Image PE.
  7. Attach devices in order: boot timer, balloon, block (root first), network, pmem, vsock, entropy/RNG, virtio-mem hotplug, legacy devices (aarch64 only), VMGenID, VMClock.
  8. configure_system_for_boot() — architecture-specific: on x86-64, patches the kernel command line and zero page; on aarch64, constructs the Flattened Device Tree via vm-fdt. Firecracker uses MMIO throughout — there is no PCI bus on either architecture — so no PCI configuration is performed here.
  9. Spawn vCPU threads via KvmVm::start_vcpus(). Each thread applies its seccomp-BPF filter before doing any other work; a missing "vcpu" entry in the BpfThreadMap panics the thread with MissingSeccompFilters.
  10. Register the Vmm struct with EventManager.
  11. The VMM is now in the Paused state. Vmm::resume_vm() signals every vCPU thread to enter KVM_RUN.

Step 6 is where the uncompressed vmlinux requirement bites developers who try to use a bzImage directly: the ELF header check in linux-loader fails fast, rather than silently executing the real-mode decompressor stub and producing a boot that hangs without explanation.

sequenceDiagram
  participant API as "API thread (fc_api)"
  participant VMM as "VMM thread"
  participant KVM as "/dev/kvm"
  participant vCPU as "fc_vcpu 0"
  participant Guest as "Guest kernel"

  API->>VMM: InstanceStart (via mpsc + EventFd)
  VMM->>KVM: "KVM_CREATE_VM → VmFd"
  VMM->>KVM: "KVM_SET_USER_MEMORY_REGION (DRAM slots)"
  VMM->>KVM: "KVM_CREATE_VCPU → VcpuFd"
  VMM->>vCPU: spawn fc_vcpu 0 (paused)
  VMM->>VMM: attach devices, configure system
  VMM->>vCPU: resume_vm() (VcpuEvent::Resume)
  vCPU->>KVM: "KVM_RUN"
  KVM-->>vCPU: "VM exit: KVM_EXIT_MMIO (virtio probe)"
  vCPU->>vCPU: "mmio_bus.read/write (inline, no VMM wake)"
  vCPU->>KVM: "KVM_RUN (re-enter)"
  KVM-->>Guest: guest executes at kernel entry
  Guest->>Guest: virtio device enumeration
  Guest->>Guest: "kernel_init execve /sbin/init"

The diagram above condenses the loop: after the first MMIO exit, the vCPU continues issuing KVM_RUN calls and handling exits until the guest is fully initialized. Each MMIO exit is handled inline in the vCPU thread — handle_kvm_exit() calls mmio_bus.read() or mmio_bus.write() directly on the Peripherals struct attached to that vCPU. The mpsc channel to the VMM thread carries only control events (Pause, Resume, SaveState).

The Seccomp Boundary

Each of the three thread types receives a distinct seccomp-BPF filter, applied inside the spawned thread before any other work. The filter map is keyed by the strings "api", "vmm", and "vcpu". Firecracker's design document states the threat model directly: "All vCPU threads are considered to be running malicious code as soon as they have been started; these malicious threads need to be contained." The vCPU threads have the narrowest allowlist for exactly this reason: they make a KVM_RUN ioctl, read from kvm_run, and do almost nothing else. The API thread and VMM thread need a wider set — socket operations, file I/O for block devices, epoll_wait — but each is still a small, auditable list.

The filters are compiled at build time from JSON by seccompiler-bin and embedded in the binary. A failure to install the filter in any thread class causes that thread to panic; the panic propagates through the exit_evt EventFd to the VMM's EventManager, which shuts the process down cleanly. There is no path through which a misconfigured filter silently degrades to permissive mode.

Firecracker's design document describes a concentric trust model, nesting several zones from least trusted (guest vCPU threads) to most trusted (host). The hardware VMX boundary separates guest from VMM; the seccomp filter separates the VMM from most of the host kernel syscall table; the jailer's chroot and namespaces separate the Firecracker process from the rest of the host filesystem. Chapter 18 covers the jailer in detail; Chapter 19 covers the seccomp filters.

What One firecracker Process Holds Open

After a guest is started, a single firecracker process holds:

The total number of file descriptors scales with the device count. A minimal guest — one vCPU, one block device, one network interface — holds on the order of two dozen fds. The VMM memory overhead at this configuration, excluding guest RAM, is specified in SPECIFICATION.md as no more than 5 MiB; in practice it runs around 3 MiB.

That ceiling, 5 MiB of overhead for a complete VMM process, is what makes packing thousands of microVMs onto a single host tractable. The Firecracker USENIX ATC 2020 paper reports QEMU's VMM overhead at roughly 131 MiB per VM (Table 1 of the paper); against Firecracker's 5 MiB ceiling the gap is more than 25×. The difference comes from the minimal device model and the absence of a general-purpose emulation layer — the next chapter opens on the device set that stays within that envelope.

Sources And Further Reading