Chapter 3: Why MicroVMs Exist

The serverless billing model promises that a user pays for exactly the CPU cycles their function consumed, nothing more. Delivering that promise at scale means packing thousands of independent workloads onto a single host while giving each one a security boundary it cannot escape. Containers are the obvious first answer, and for most uses they are sufficient. But in a multi-tenant serverless platform where every customer's code runs without review on shared hardware, the container model has a structural weakness: every tenant shares one kernel. That one fact, and the effort to work around it without sacrificing density, is where microVMs come from.

The Isolation-vs-Density Tradeoff

A Linux container is not a separate operating system instance. It is a collection of views over a shared kernel, carved out by clone(2), unshare(2), and setns(2). Eight namespace types govern these views: mount tables (since Linux 3.8), PIDs (3.8), UTS hostname (3.0), IPC and POSIX message queues (implemented in Linux 2.6.19; the /proc/[pid]/ns/ipc symlink appeared in 3.0), network stacks (implemented in Linux 2.6.24; the /proc/[pid]/ns/net symlink appeared in 3.0), user and group IDs (3.8), cgroup roots (4.6), and monotonic and boot-time offsets (5.6). Everything inside those namespaces looks isolated. The kernel itself does not.

x86-64 Linux exposes more than 400 system calls. A process in any container reaches the same dispatch table as a process on the bare host. Any kernel vulnerability reachable through that interface is reachable from any container on the machine. Two CVEs make this concrete. CVE-2019-5736, published on 2019-02-11 with a CVSS score of 8.6, allowed a process running as uid 0 inside a container to race a file-descriptor open on /proc/self/exe and overwrite the host runc binary, achieving root code execution on the host. The /proc filesystem crosses namespace boundaries by design; the exploit required nothing more exotic than that. It was fixed in runc 1.0-rc7. CVE-2024-21626 ("Leaky Vessels"), disclosed in January 2024 with the same CVSS score of 8.6, exploited a file-descriptor leak in runc v1.0.0-rc93 through v1.1.11: a container process could obtain a working-directory file descriptor that pointed into the host filesystem namespace, then escape the container via runc run or runc exec. Fixed in runc 1.1.12. Both exploits share one root cause: the security-critical interface is the host kernel's shared syscall table. No namespace can close that gap — the gap is structural. Seccomp filters help, but they reduce the attack surface; they do not eliminate the shared kernel.

Traditional VMs address this differently. Intel VMX introduces root and non-root operation, enforced in silicon through the Virtual Machine Control Structure (VMCS). The guest executes entirely in non-root mode; any sensitive instruction or external event causes a VM exit that transfers control to the hypervisor in root mode. The security-critical boundary is not a software dispatch table but a hardware mode transition. An attacker inside the guest has to escape through that hardware boundary, not through a shared dispatch table. That is a fundamentally different threat model.

The cost is density. A conventional VM running under QEMU 4.2 carries roughly 131 MB of VMM process overhead — measured as the non-shared VMM process memory minus configured guest RAM on an m5d.metal instance (2x Intel Xeon Platinum 8175M, 384 GB RAM, Ubuntu 18.04.2, kernel 4.15.0-1044-aws, per NSDI 2020 Figure 7). It emulates more than 40 virtual devices, drives a BIOS, runs a bootloader, and takes seconds to reach the first userspace process. At 131 MB of overhead per VM, a 384 GB host can support roughly 2,900 fully-sized VMs before touching guest RAM at all. If functions are as small as 128 MB — the minimum Lambda allocation — the VMM overhead alone represents a full function's worth of RAM on every slot.

flowchart LR A["Linux container\n(namespaces + seccomp)"] B["Traditional VM\n(QEMU/KVM)"] C["MicroVM\n(Firecracker/KVM)"] A -->|"+ density\n- isolation surface"| D["Shared kernel\nsyscall table"] B -->|"+ isolation\n- density"| E["Hardware VMX/SVM\n~131 MB VMM overhead\nboot in seconds"] C -->|"+ isolation\n+ density"| F["Hardware VMX/SVM\n<5 MiB VMM overhead\nboot in <125 ms"]

The AWS Lambda Origin

Firecracker was built at Amazon to solve this problem for Lambda and Fargate. The architecture that preceded it used Linux containers to isolate individual functions within a customer's account, and separate EC2 VMs to isolate between customers. That two-tier arrangement imposed a structural inefficiency: each VM had to be sized before anyone knew which tenant mix would fill it. A VM provisioned for 128 MB functions wastes capacity if larger functions land on it; a VM sized for 1.5 GB functions underserves 128 MB slots. The outer VM boundary also meant that Lambda's per-function billing had a real VM lurking behind it, with real boot latency and real idle overhead.

Lambda's production constraints dictated the design requirements precisely. A host with 1 TB of RAM running 128 MB functions needs up to 8,000 function slots to be fully packed; VMM overhead must therefore be negligible as a fraction of guest RAM, not a flat cost per slot. Lambda pre-boots a pool of MicroVMs to absorb burst traffic — by Little's Law, at a 125 ms creation time, one pooled slot is consumed per 8 new invocations per second that arrive faster than the pool refills. Each slot lives at most 12 hours, then is recycled; the same slot handles many serial invocations of the same function.

Firecracker entered internal Lambda production in 2018. The open-source release was announced on 2018-11-27 on the AWS open-source blog under the Apache 2.0 license. The NSDI 2020 paper reports that Firecracker handles "trillions of requests per month" across Lambda and Fargate (§1, §4.1). The paper's authors are Alexandru Agache and colleagues at Amazon; it appeared at USENIX NSDI 2020.

Firecracker's Six Design Goals

The NSDI 2020 paper states six explicit design criteria in §2. They define the axes along which the paper's benchmarks are organized and against which individual design choices are justified.

Isolation. The guest boundary must be enforced by the hardware VMX/SVM mechanism, not by software policy. Every vCPU thread is treated as potentially hostile from its first instruction. In practice, this means jailer — Firecracker's companion process — drops into a chroot, sets up Linux namespaces, installs its own seccomp-BPF filter covering exactly 24 syscalls (with argument filtering) and 30 ioctls, and then execs firecracker. Of those 30 ioctls, 22 are required by KVM's own ioctl-based API (KVM_CREATE_VM, KVM_SET_USER_MEMORY_REGION, KVM_RUN, and so on). The 24/30 figures are the jailer's own filter; Firecracker itself runs three thread types — API server, VMM, and vCPU threads — each with a separate, narrower seccomp-BPF filter compiled at build time from JSON by seccompiler-bin and embedded in the binary. A syscall not in the relevant thread's whitelist delivers SIGSYS to the offending thread; it does not reach the kernel at all.

Overhead and density. SPECIFICATION.md requires that the Firecracker VMM process consume no more than 5 MiB of overhead — defined as non-shared VMM process memory minus configured guest RAM — for a single-vCPU guest with 128 MiB of RAM using the Firecracker-tuned kernel. In practice the measured overhead is approximately 3 MiB (NSDI 2020 Figure 7). This overhead does not scale with VM size; adding RAM to the guest adds nothing to the VMM footprint. In production Lambda and Fargate, Firecracker accounts for 3% of total RAM (§5.4), compared to the 131 MB QEMU would require for each slot.

Performance. The guest compute floor, defined in SPECIFICATION.md, targets better than 95% of bare-metal throughput. For I/O the paper reports that 4 kB read P99 latency inside a Firecracker guest adds only 49 µs over the native NVMe result on the same m5d.metal host. Network throughput is specified in SPECIFICATION.md as at least 14.5 Gbps at 80% CPU utilization, or 25 Gbps at 100% CPU. These targets govern the design of the virtio device emulators discussed in later chapters.

Soft allocation. Lambda and Fargate require memory and CPU oversubscription. SPECIFICATION.md and the paper specify that Firecracker must support oversubscription; the paper evaluates it at ratios exceeding 20x and reports production operation at up to 10x (§5.4). This goal has a direct consequence for the device model, discussed below.

Fast switching. A slot must be created fast enough that the pre-boot pool can be refilled as quickly as it drains. SPECIFICATION.md defines the boot time target as the wall-clock duration from the InstanceStart API call until the guest kernel forks /sbin/init, using a minimal init, no serial console, no networking, and a minimal kernel configuration. That target is under 125 ms. SPECIFICATION.md also specifies that the Firecracker process itself — the VMM before any guest is configured — must have its API socket ready within 8 CPU-milliseconds of process start; in practice the wall-clock time is 6–60 ms, most commonly around 12 ms. The design point in docs/design.md is 5 MicroVMs per host core per second, the only confirmed throughput figure in primary sources; 150 MicroVMs per second per host is a commonly cited headline figure that appears on the Firecracker homepage but is not stated in SPECIFICATION.md or design.md.

Compatibility. Firecracker must run an unmodified Linux guest kernel and standard ELF binaries. No kernel patches. No guest agent. This is what makes Lambda transparently support existing code without recompilation.

Why Rust

Firecracker began life as a fork of Google's crosvm, the ChromeOS VMM, which is itself written in Rust. The team deleted USB support, GPU passthrough, the 9p filesystem driver, and other components. At the time of the NSDI 2020 paper, Firecracker contained approximately 50,000 lines of Rust — the team had added more than 20,000 new lines and changed 30,000 lines since the fork, and the codebase was fewer than half the size of crosvm at that point (NSDI 2020 §2.1). QEMU 4.2, for comparison, has more than 1.4 million lines of C.

That ratio is not a vanity metric. A smaller codebase means a smaller set of syscalls needed from the host kernel. QEMU requires up to 270 unique syscalls during operation; KVM itself adds approximately 120,000 lines of kernel code to the host's trusted computing base (paper §2.1.3). Firecracker's jailer seccomp filter reduces the host kernel's attack surface from those 270 to 24. Rust was chosen specifically because the device emulators, which handle attacker-controlled input from the guest virtio queues, must be memory-safe without relying on a garbage collector. A memory allocator pause in a VMM thread stalls a vCPU; that is unacceptable in a sub-millisecond latency path. The compile-time invariants enforced by Rust's borrow checker — in particular, the prohibition on data races across thread boundaries — are enforced at build time, not at runtime, which means they do not add runtime overhead to the critical path.

What You Give Up

Everything above is bought with deliberate omissions. Firecracker is fast and dense because it refused to implement what Lambda and Fargate do not need.

The Minimal Device Model

At the time of the NSDI 2020 paper, Firecracker emulated exactly four device types: virtio-net (network), virtio-block (storage), a 16550A-compatible serial console (ttyS0), and a partial i8042 keyboard controller implemented in fewer than 50 lines of Rust, used only to signal reboot and shutdown. Since then the source tree has grown to include virtio-balloon (VIRTIO_ID_BALLOON = 5), virtio-vsock (VIRTIO_ID_VSOCK = 19), virtio-rng (VIRTIO_ID_RNG = 4), virtio-pmem (VIRTIO_ID_PMEM = 27), and virtio-mem (VIRTIO_ID_MEM = 24). The FAQ lists six devices; the current source tree may diverge from that count as development continues.

All of these use the virtio MMIO transport (virtio spec §4.2), not PCI (§4.1). Each device occupies a 4 KiB window in guest physical address space; the first device is mapped at 0xc0001000, the second at 0xc0002000, and so on. The guest kernel learns about each device through its command line, which includes a parameter of the form virtio_mmio.device=4K@0xc0001000:5, where the components are size@gpa:IRQ. There is no PCI bus, no config space, no MMIO discovery scan — the guest is told exactly where each device lives before it boots. This eliminates the probe timeouts that add up to 900 ms to an Ubuntu 18.04 kernel boot on the same hardware.

What is absent relative to QEMU's more than 40 emulated devices: GPU (VIRTIO_ID_GPU = 16), virtio-console (3), SCSI (8), virtio-input (18), crypto (20), sound (25), virtiofs (26), and all physical device passthrough. These gaps are features for Lambda; they are missing dependencies for workloads that need them.

No BIOS, No PCI, No USB

Firecracker does not present a BIOS. It does not emulate legacy ISA or PCI devices. It does not support USB. The VMM loads the Linux kernel directly via the x86 Linux boot protocol, identified by the four-byte magic 0x53726448 — the ASCII string "HdrS" — at offset 0x202 in the kernel image. On x86-64 the guest image must be an uncompressed ELF (vmlinux); on aarch64 it must be a PE-formatted Image. The kernel command line includes pci=off, suppressing PCI bus enumeration entirely. There is no ACPI power management (no S5 shutdown path) on x86, and guest reboot is not supported on any architecture (FAQ.md). Firecracker does generate ACPI DSDT tables for virtio-MMIO device enumeration on recent versions — which is why CONFIG_ACPI=y is still required for block-device boot on x86-64, as the kernel needs ACPI to parse the device table even though it will never send an S5 event.

This also means Firecracker cannot boot an arbitrary kernel binary from a disk image — not because of a policy restriction but because the entire guest setup assumes direct kernel loading. There is no stage to hand off to GRUB.

A Curated Guest Kernel

The compatibility goal says "unmodified guest binaries," and that is true. The kernel is a different matter. SPECIFICATION.md specifies a minimum supported guest kernel of Linux 6.1 (requiring Firecracker v1.9.0 or later; support ends 2026-09-02). Linux 5.10 reached its minimum support commitment in Firecracker on 2024-01-31 (kernel-policy.md).

For a block-device boot on x86-64, SPECIFICATION.md requires the following kernel configuration: CONFIG_VIRTIO_MMIO=y, CONFIG_VIRTIO_BLK=y, CONFIG_ACPI=y, CONFIG_PCI=y, and CONFIG_KVM_GUEST=y. For initrd-only boot, the last three can be omitted. The CONFIG_KVM_GUEST=y flag enables CONFIG_KVM_CLOCK, which replaces the TSC and HPET-based timekeeping with paravirtualized clock reads, keeping the guest's sense of time accurate without costly VM exits on every rdtsc. On aarch64 the equivalent requirements are CONFIG_ARM_AMBA=y and CONFIG_RTC_DRV_PL031=y.

The reference CI kernel configuration for x86-64 Linux 6.1 (available at resources/guest_configs/microvm-kernel-ci-x86_64-6.1.config in the Firecracker repository) disables all physical hardware drivers: CONFIG_USB, CONFIG_WIRELESS, CONFIG_BLUETOOTH, CONFIG_SOUND, CONFIG_DRM, CONFIG_FB, CONFIG_BLK_DEV_NVME, CONFIG_ATA, CONFIG_MD, and CONFIG_BLK_DEV_SD are all unset. The compressed kernel image is 4.0 MB with no modules; Ubuntu 18.04's kernel is 6.7 MB plus 44 MB of modules.

The GPU Question

VFIO/PCI passthrough — the standard Linux mechanism for assigning a physical PCI device to a guest — is not implemented. Firecracker GitHub issue #849, opened in January 2019, tracked the feature request; Discussion #4845 (February 2025) announced that the team was pausing the effort due to insufficient internal resources. The underlying conflict is structural: VFIO requires guest memory to be pinned (non-swappable by the host) for the duration of DMA operations. Pinned memory cannot be balloon-reclaimed or tracked for dirty-page migration. That directly contradicts the soft-allocation design goal, which depends on the balloon driver to reclaim idle guest memory across thousands of slots. A future implementation would have to choose one or the other; Firecracker has chosen soft allocation.

Discussion #4845 also describes a planned MVP for GPU support that explicitly excluded multiple GPUs, peer-to-peer GPU transfers, VF passthrough, GPU snapshot/resume, GPU-Direct, NVMe, and hotplugging. That scope description is itself a summary of what the minimal device model considers out of scope. For GPU compute workloads, a container or a traditional VM is the right tool.

The I/O Ceiling

The serial I/O path — one virtio queue, one worker thread, one host file-descriptor per device — is simple and predictable, but it is not the fastest possible path. On the m5d.metal evaluation host, bare-metal NVMe sustains more than 340,000 4 kB read IOPS (approximately 1 GB/s). Inside a Firecracker guest with virtio-block, the ceiling is approximately 13,000 IOPS (52 MB/s at 4 kB). The P99 latency penalty over native is only 49 µs; the throughput gap is the cost of serial queue handling (NSDI 2020 §5.3).

Network throughput is bounded similarly. The host tap interface on m5d.metal reaches 44–46 Gb/s. A Firecracker guest peaks at approximately 15 Gb/s across all tested configurations (single stream, 10 concurrent streams). SPECIFICATION.md specifies a floor of 14.5 Gbps at 80% CPU or 25 Gbps at 100% CPU — considerably below what the hardware can deliver, but also considerably above what a serverless function invocation requires. The throughput scales linearly to 16 concurrent MicroVMs, which is the production-relevant question: aggregate host bandwidth degrades gracefully as slot count increases.

The virtio-PCI transport, which Firecracker does not support, would reduce per-operation overhead by replacing MMIO window polling with MSI-X interrupts — a structural advantage for latency-sensitive workloads. That is a separate line of work from device passthrough, and it would require adding PCI config space emulation to the VMM.

The Boot Time in Detail

The 125 ms target measures wall-clock time from InstanceStart to the first instruction of /sbin/init, under three conditions: serial console disabled, Firecracker-tuned kernel, minimal root filesystem.

Five hundred serial-boot samples on m5d.metal (NSDI 2020 §5.1) show that Firecracker in pre-configured mode ("FC-pre," where the API calls to set up the kernel and rootfs are completed before the InstanceStart call) boots about twice as fast as a comparable QEMU configuration. The dominant cost in a Linux kernel boot on that hardware is not CPU work — it is device probe timeouts. An Ubuntu 18.04 kernel adds approximately 900 ms compared to a minimal Firecracker kernel because it probes for hardware that does not exist; those probes time out rather than fail fast. Disabling the serial console (console= absent from the kernel command line) saves up to 70 ms. Adding a single statically configured network interface costs 20 ms (Firecracker) or 35 ms (QEMU).

In parallel-boot testing — 1,000 MicroVMs with 50 concurrent creation requests — FC-pre achieves a P99 boot time of 146 ms. At 100 concurrent: 153 ms. The 125 ms target is a serial-path SLA; the parallel path adds contention for KVM device nodes and host memory bandwidth. This is why Lambda keeps a pre-booted pool rather than creating VMs inline with incoming requests: 125 ms is fast for a VM, and it is too slow to absorb a cold burst without buffering.

sequenceDiagram participant C as Caller participant J as jailer participant F as firecracker VMM participant K as KVM participant G as Guest kernel C->>J: exec jailer (chroot, namespaces, seccomp) J->>F: exec firecracker F->>K: KVM_CREATE_VM K-->>F: vm_fd Note over F: API socket ready (~12 ms wall-clock) C->>F: PUT /boot-source, /drives, /network-interfaces C->>F: PUT /actions InstanceStart F->>K: KVM_SET_USER_MEMORY_REGION (map vmlinux at guest physical 0x100000) F->>K: KVM_CREATE_VCPU F->>K: KVM_RUN Note over G: Guest begins executing at kernel entry point G->>G: decompress, init subsystems G->>G: probe virtio-MMIO devices (no PCI scan) G->>G: fork /sbin/init Note over G: ~125 ms from InstanceStart

Sources And Further Reading

Agache, A. et al., "Firecracker: Lightweight Virtualization for Serverless Applications," USENIX NSDI 2020. PDF: https://www.usenix.org/system/files/nsdi20-paper-agache.pdf
Firecracker SPECIFICATION.md: https://github.com/firecracker-microvm/firecracker/blob/main/SPECIFICATION.md
Firecracker docs/design.md: https://github.com/firecracker-microvm/firecracker/blob/main/docs/design.md
Firecracker FAQ.md: https://github.com/firecracker-microvm/firecracker/blob/main/FAQ.md
Firecracker docs/kernel-policy.md: https://github.com/firecracker-microvm/firecracker/blob/main/docs/kernel-policy.md
Firecracker reference kernel config (x86-64, Linux 6.1): https://github.com/firecracker-microvm/firecracker/blob/main/resources/guest_configs/microvm-kernel-ci-x86_64-6.1.config
Firecracker GPU/VFIO Discussion #4845: https://github.com/firecracker-microvm/firecracker/discussions/4845
Firecracker GPU Issue #849: https://github.com/firecracker-microvm/firecracker/issues/849
AWS open-source launch blog (2018-11-27): https://aws.amazon.com/blogs/opensource/firecracker-open-source-secure-fast-microvm-serverless/
OASIS virtio 1.2 specification (CS01), §4.2 (MMIO transport): https://docs.oasis-open.org/virtio/virtio/v1.2/cs01/virtio-v1.2-cs01.html
Linux KVM API documentation: https://www.kernel.org/doc/html/latest/virt/kvm/api.html
Linux x86 boot protocol: https://www.kernel.org/doc/html/latest/arch/x86/boot.html
Linux namespaces(7) man page: https://man7.org/linux/man-pages/man7/namespaces.7.html
CVE-2019-5736 (runc container escape, CVSS 8.6): https://nvd.nist.gov/vuln/detail/CVE-2019-5736
CVE-2024-21626 ("Leaky Vessels", runc file-descriptor escape, CVSS 8.6): https://nvd.nist.gov/vuln/detail/CVE-2024-21626
linux/virtio_ids.h: https://raw.githubusercontent.com/torvalds/linux/master/include/uapi/linux/virtio_ids.h
linux/kvm.h: https://raw.githubusercontent.com/torvalds/linux/master/include/uapi/linux/kvm.h
Firecracker NSDI 2020 benchmark data: https://github.com/firecracker-microvm/nsdi2020-data