Chapter 20: The Threat Model

If you run untrusted code on a multi-tenant system, the central question is not whether isolation works in the happy path — it does. The question is what an adversary can reach after breaking the first barrier. Containers answer it one way: a process that escapes its namespace lands in the host kernel, which is also the kernel every other tenant is running on. MicroVMs answer it differently, with a sequence of barriers, each designed to contain what the previous one failed to stop.

This chapter maps that sequence in Firecracker's terms. The barriers are not informal defense-in-depth platitudes. They are hardware CPU modes, per-thread BPF filters, and a setuid jail binary — each with a specific syscall, VMCS field, or kernel module parameter attached to the claim.

The Trust Axiom

Firecracker's design document states the baseline assumption plainly:

"all vCPU threads are considered to be running malicious code as soon as they have been started; these malicious threads need to be contained."

This is not a disclaimer hedging an improbable edge case. It is the load-bearing premise the entire architecture rests on. A guest OS that boots correctly and runs cooperative workloads is operationally convenient; the containment model is designed for a guest that has been taken over and is actively probing for weaknesses in every direction.

From that premise, the trust hierarchy partitions into three zones: an untrusted zone (all guest vCPU threads and all guest network traffic), a semi-trusted zone (the Firecracker VMM process itself), and a trusted zone (the host kernel, KVM module, the Unix socket API channel, snapshot files, and the physical hardware). "Semi-trusted" is precise: Firecracker is written in safe Rust with a deliberately small codebase, but it remains a userspace process, and a vulnerability in its virtio emulation paths is in scope for exploitation. What limits the damage is the containment imposed around that process — not the assumption that the process is correct.

Firecracker explicitly disclaims network traffic filtering. The design document states that all egress from a guest is untrusted and must be filtered at the host level by the operator — typically with nftables rules applied to the TAP interfaces. That delegation is not a weakness; it reflects where the right tool sits. The VMM is not a firewall.

flowchart TB subgraph untrusted["Untrusted zone"] vcpu["Guest vCPU threads"] gnet["Guest network egress"] end subgraph semitrusted["Semi-trusted zone"] vmm["Firecracker VMM process"] end subgraph trusted["Trusted zone"] hk["Host kernel / KVM"] api["Unix socket API"] snap["Snapshot files"] end vcpu -- "VM exit" --> vmm vmm -- "KVM ioctls" --> hk gnet -. "operator nftables" .-> hk

Layer 1: Hardware Virtualization

The outermost barrier is the one the CPU enforces without software assistance. Intel VMX introduces two orthogonal operating modes: VMX root operation, where the hypervisor and host OS run, and VMX non-root operation, where the guest runs. Each mode retains the usual CPL 0--3 ring hierarchy, but a guest OS running at CPL 0 in VMX non-root mode does not have full ring-0 privilege. Every privileged action the guest attempts — writing CR3, accessing an MSR, executing INVLPG — is governed by the VM-execution control fields in the VMCS (Virtual Machine Control Structure), and most of them cause a VM exit rather than executing.

On a VM exit, the CPU atomically loads host state from the VMCS host-state area — CR0, CR3, CR4, segment selectors, RIP and RSP from the HOST_RIP and HOST_RSP fields — and saves guest state. Linux's arch/x86/kvm/vmx/vmenter.S notes this directly: "After a successful VMRESUME/VMLAUNCH, control flow 'magically' resumes below at vmx_vmexit due to the VMCS HOST_RIP setting." The guest did not transfer control voluntarily; the CPU forced the transition and simultaneously switched to a separate address space.

That vmenter.S path also zeroes all general-purpose registers except RSP and RBX before returning to host code, preventing speculative use of guest register values in host execution paths. RSB (Return Stack Buffer) clearing and SPEC_CTRL MSR handling are applied as post-exit mitigations for Spectre-class side channels — more on those below.

AMD SVM is structurally parallel. The VMCB (Virtual Machine Control Block) is divided into a control area (intercepts, ASID, ASID flush bits) and a save area (guest register state). VMRUN saves host state to the area pointed at by the HSAVE_PA MSR and enters the guest; #VMEXIT reverses it.

The KVM API exposes this hardware boundary through the three-scope ioctl hierarchy. Applications verify KVM_GET_API_VERSION returns 12, create a VM with KVM_CREATE_VM (_IO(KVMIO, 0x01)) on /dev/kvm, register guest memory with KVM_SET_USER_MEMORY_REGION using struct kvm_userspace_memory_region (slot, flags, guest_phys_addr, memory_size, userspace_addr), and run a vCPU with KVM_RUN (_IO(KVMIO, 0x80), decimal 44672). When a guest action requires VMM intervention, KVM sets kvm_run->exit_reason in the shared mmap region and returns. Common exit reasons include KVM_EXIT_IO (2) for port I/O, KVM_EXIT_MMIO (6) for MMIO, and KVM_EXIT_SHUTDOWN (8) for guest shutdown. Operations the host kernel can handle entirely in KVM — local APIC, IOAPIC, PIT — never cross the KVM_RUN boundary to userspace at all.

The guarantee this layer provides is direct: guest code cannot read or write host memory, cannot execute privileged host instructions, and cannot modify host page tables. What it does not guarantee is that KVM's own kernel-mode code is bug-free — and that caveat is exactly where CVE-2021-29657 sits.

Layer 2: Seccomp BPF Filters

Chapter 19 walked through the seccomp(2) mechanism and Firecracker's filter allow-lists in detail. Here the relevant frame is what the filters contribute to the layered barrier.

Suppose a guest has found a bug in the KVM hardware boundary and is now executing arbitrary code inside a vCPU thread on the host. Without any further containment, that thread can call every syscall the process is permitted to call: socket, execve, fork, mount. The host kernel evaluates each one against the Firecracker process's credentials and the host's network configuration. A guest that can issue arbitrary syscalls on the host has escaped.

Seccomp BPF filters answer this by restricting what syscalls each thread in the Firecracker process can reach, before examining whether any individual call is malicious. The filter policy is not system-wide; it is per-thread and applied from three distinct JSON sources compiled at build time by seccompiler-bin into BPF bytecode embedded in the firecracker binary. The three thread categories and their approximate allow-list sizes on x86_64-unknown-linux-musl (main branch):

The default_action across all three filters is "trap" — mapping to SECCOMP_RET_TRAP, which delivers SIGSYS. An unlisted call does not return an error; it terminates the thread. An operator can supply a custom pre-compiled filter via --seccomp-filter, but the default posture is deny-by-default.

Beyond the syscall number, Firecracker uses the argument-evaluation capability of seccomp BPF — the filter receives struct seccomp_data.args[6] and can inspect up to six arguments, though it cannot dereference pointers. A handful of constraints that matter for an escape scenario:

The combined effect: a compromised VMM process cannot open a network socket, cannot JIT executable code, and cannot escalate through signal tricks. It is constrained to the exact ioctls and syscalls Firecracker itself needs to run.

Installing a seccomp BPF filter requires either CAP_SYS_ADMIN or a prior prctl(PR_SET_NO_NEW_PRIVS, 1) call; the kernel returns -EACCES otherwise. Firecracker uses PR_SET_NO_NEW_PRIVS — which also prevents execve from granting the child more privileges than the parent, closing an escalation path before any filter is in place.

Layer 3: The Jailer

The jailer is a separate setuid binary. Its job is to set up every privileged resource Firecracker needs, then exec() into the firecracker binary. After that handoff, firecracker can only access resources the jailer explicitly created inside the jail before transferring control.

The sequence of operations the jailer binary performs, in order:

  1. Places the process into a cgroup (v1: by writing to the tasks file under one of cpuset, cpu,cpuacct, memory, net_cls,net_prio, or pids; v2: by writing to cgroup.procs). The --cgroup-version flag selects which (default: v1).
  2. Creates a new mount namespace with unshare(), then calls pivot_root() — not the older chroot() — to establish a jail root at <chroot_base>/<exec_file_name>/<id>/root.
  3. Creates only the device nodes Firecracker actually needs inside the chroot: /dev/net/tun and /dev/kvm. Nothing else is mknod()'d.
  4. Optionally creates a PID namespace via clone(CLONE_NEWPID) (the --new-pid-ns flag) and a network namespace via setns(fd, CLONE_NEWNET) (the --netns flag).
  5. Closes all file descriptors that were not explicitly inherited, using close_range(3, UINT_MAX, CLOSE_RANGE_UNSHARE) (close_range syscall, requires kernel 5.9+).
  6. Drops privilege with setuid(uid) and setgid(gid) to a unique non-privileged uid/gid per instance.
  7. Sets resource limits: setrlimit(RLIMIT_FSIZE) and setrlimit(RLIMIT_NOFILE), the latter defaulting to 2048 file descriptors.
  8. exec()s into firecracker.

The use of pivot_root() rather than chroot() is meaningful. chroot() only changes the root directory for path resolution; a process with CAP_SYS_CHROOT can break out of a chroot() jail. pivot_root() replaces the entire mount namespace root, so the old filesystem tree is genuinely unreachable unless the jailer explicitly binds it in — which it does not.

The jailer's own inputs are treated as trusted. The jailer documentation is explicit: it is the operator's responsibility to ensure that jailer input paths cannot be tampered with by unprivileged local users. The jailer does not defend against a malicious operator; it defends against a compromised firecracker process trying to reach host resources.

sequenceDiagram participant Op as "Operator / orchestrator" participant J as "jailer (setuid)" participant FC as "firecracker" participant KVM as "KVM / host kernel" Op->>J: exec jailer --id foo --uid 123 --gid 123 J->>KVM: cgroup assignment, unshare, pivot_root J->>KVM: mknod /dev/kvm, /dev/net/tun inside chroot J->>J: close_range, setuid(123), setgid(123) J->>FC: exec firecracker (inside jail, unprivileged) FC->>KVM: KVM_CREATE_VM on /dev/kvm Note over FC: seccomp filters applied before guest starts FC->>KVM: KVM_RUN (guest vCPU executes)

Device Model as Attack Surface Reduction

The device model bounds what the attacker has to aim at in the first place — not by filtering, but by the code not being present.

Firecracker's emulated device set is: VirtIO Net, VirtIO Block, and VirtIO Vsock (all over a virtio-mmio transport with I/O rate limiting); a serial console (8250 UART); a partial i8042 keyboard controller used only for reboot signaling; and the PIC, IOAPIC, and PIT handled entirely within KVM's in-kernel emulation. That is the complete list. There is no PCI bus, no BIOS or firmware, no ACPI, no USB controller, no GPU, no floppy disk controller, no sound device, no legacy ISA devices beyond i8042 and 8250. The guest boots via a direct-boot protocol straight to the Linux kernel, with no firmware layer in the path.

This absence is a security property. VENOM (CVE-2015-3456) exploited the floppy disk controller emulation in QEMU: fdctrl_handle_drive_specification_command() allocated a 512-byte FIFO buffer, but a missing data_pos reset in one branch of the fifth-parameter handling allowed the write pointer to advance past the buffer boundary on every subsequent I/O byte. A privileged guest user writing to the FDC I/O port could overflow the heap region immediately following that FIFO. CVSS 2 score: 7.7 HIGH, fixed via commit e907746266721f305d67bc0718795fedee2e824c, released in QEMU after 2.3.0. In Firecracker, the code path does not exist. You cannot exploit emulation for a device the VMM did not implement.

The same logic applies to the virtio descriptor table, which Firecracker does parse. The virtio 1.2 specification defines struct virtq_desc as 16 bytes: le64 addr (guest-physical), le32 len, le16 flags, le16 next. The flags field carries VIRTQ_DESC_F_NEXT (bit 0, value 1) for descriptor chaining and VIRTQ_DESC_F_WRITE (bit 1, value 2) for write-only device buffers. Maximum queue size is 32,768 entries. The spec places a MUST-level obligation on the VMM at section 2.7.5.1: "A device MUST NOT write to any descriptor table entry." Equally, the VMM must validate all descriptor fields before acting on them — the addr, len, and next fields are all guest-controlled and all could be crafted to manipulate host memory if bounds checks are missing.

CVE-2019-14835 is the canonical example of what happens when that validation is absent. The get_indirect() function in the vhost-net kernel driver (drivers/vhost/vhost.c) iterated up to USHRT_MAX + 1 (65,536) times writing to a log buffer during live migration, without checking that *log_num stayed within the actual buffer size. A guest could craft descriptor tables with large len values to trigger the overflow during a migration event, achieving kernel heap overflow. CVSS 3.1: 7.8 HIGH. Introduced in Linux 2.6.34, fixed in Linux 5.3, commit 060423bfdee3f8bc6e2c1bac97de24d5415e2bc4. This was a kernel-mode virtio backend, not a userspace VMM, but the attack surface is the same: guest-controlled descriptor table content reaching code that fails to validate it.

Comparison to Container Isolation

The contrast between container isolation and microVM isolation is not a matter of degree. It is a categorical difference in what the attacker reaches if the first barrier breaks.

A container shares the host kernel. Every syscall a containerized process issues goes directly to the same kernel all other containers on the host are running on. Docker's default seccomp profile blocks approximately 44 syscalls out of 300-plus, leaving roughly 256 reachable to the host kernel by default. Research from 2025 (arxiv:2510.03720) showed that optimized per-container syscall limiting can reduce the average allowed set to roughly 87 syscalls, which would statically prevent exploitation of 87 CVEs in the study's dataset — a meaningful improvement, but still operating entirely within the shared-kernel model.

Consider CVE-2022-0847, Dirty Pipe. The commit f6dd975583bd ("pipe: merge anon_pipe_buf*_ops"), introduced in Linux 5.8, left the PIPE_BUF_FLAG_CAN_MERGE flag in struct pipe_buffer uninitialized. An unprivileged process could fill all pipe ring slots — setting the flag on each — then splice() a read-only file's page into the pipe (inheriting the flag from the ring), then write() to append into the page cache, silently overwriting read-only file content without write permission. Fixed in Linux 5.16.11, 5.15.25, and 5.10.102 on 2022-02-23.

From a container, this attack calls pipe(2), splice(2), and write(2) — all syscalls Docker's default profile permits — directly into the host kernel's syscall handler. The result is host page cache overwrite. From a microVM guest, those same calls go to the guest OS kernel. The host kernel does not see them. Reaching the host would require first escaping the hardware VMX/SVM boundary, surviving the seccomp filter, and escaping the jailer's pivot_root() jail — three barriers that are not present in the container model.

Similarly, CVE-2017-7308 let an unprivileged user reach privilege escalation by crafting setsockopt() calls via AF_PACKET sockets. From a container, socket(AF_PACKET, ...) may reach the host kernel depending on the container's seccomp profile. From a Firecracker VMM process, socket() is allowed by the seccomp filter only for AF_UNIX (value 1) — AF_PACKET is not on the list, and the default action is SECCOMP_RET_TRAP.

Property Container (default Docker) Firecracker microVM
Kernel boundary Namespace + cgroup (shared kernel) Hardware VMX/SVM + KVM
Syscall surface to host ~256 of 300+ reachable vcpu thread: ~47 via seccomp
Default seccomp posture ~44 syscalls blocked All threads: default_action=trap
Device attack surface Full host driver stack 8 emulated devices; no PCI/BIOS/USB
Escape path Single shared-kernel bug sufficient KVM escape + VMM exploit + seccomp bypass + jailer escape

The KVM Boundary Is Not Inviolable

The hardware virtualization boundary stops the vast majority of guest-originating attacks because it is enforced by CPUs that AMD and Intel have spent decades hardening. But the boundary is not a proof — it is an engineering artifact, and KVM's kernel-mode code is in scope for bugs.

CVE-2021-29657 was the first public writeup of a KVM guest-to-host breakout that did not rely on any bug in QEMU or a userspace VMM at all. Affected kernels: v5.10 through v5.12-rc6, patched in March 2021. The attack targeted KVM's AMD SVM-specific kernel-mode code directly from a guest vCPU, without requiring the guest to first manipulate the VMM process; Intel VMX users were not affected. A comparable Intel-VMX-specific guest-to-host breakout has not been publicly demonstrated. The CVE is nonetheless a proof of concept that the "KVM escape" scenario the rest of the containment model is designed for is not purely theoretical.

The practical implication for the threat model is this: the seccomp filters and the jailer are not fallback measures deployed on the assumption that the hardware boundary works. They are independent containment layers designed for the scenario where the hardware boundary has already failed.

Microarchitectural Side Channels

Software barriers are not the only threat surface. Several CPU microarchitectural vulnerabilities allow cross-tenant information leakage through shared hardware state that the VMM cannot observe or block in software.

CVE-2018-3646 (L1TF / Foreshadow-VMM) is the defining example. An x86 PTE with the Present bit cleared causes speculative execution to load the physical address from the L1D cache before the page fault is raised and before the permission check that would stop it. With Hyper-Threading enabled, a guest vCPU running on one logical processor can speculatively read L1D contents populated by the host on the sibling logical processor of the same physical core. The mitigation MSR is IA32_FLUSH_CMD at address 0x10B — write-only; writing bit 0 (L1D_FLUSH, value 1) flushes and invalidates the L1D on the executing physical core. Support is enumerated via CPUID.(EAX=07H,ECX=0):EDX[28]. Susceptibility can be checked via IA32_ARCH_CAPABILITIES MSR at 0x10A; bit 0 (RDCL_NO, value 1) indicates the processor is not vulnerable. KVM exposes the mitigation via /sys/module/kvm_intel/parameters/vmentry_l1d_flush: cond (flush only after non-audited code paths, default) or always (unconditional, with 1--50% performance overhead depending on VM exit rate).

CVE-2017-5715 (Spectre v2) targets the branch predictor. The mitigation MSRs are IA32_SPEC_CTRL at 0x48 (IBRS = bit 0, STIBP = bit 1) and IA32_PRED_CMD at 0x49 (IBPB = bit 0, write-only). KVM exposes these to guests and handles the host-side save/restore: on VM exit, for CPUs using classic IBRS, KVM sets IBRS to 0; CPUs with enhanced IBRS (eIBRS, widely available since ~2019) do not require this per-exit write because eIBRS protection persists across the transition. On VM entry KVM restores the guest's saved IBRS value. The RSB is flushed on every VM exit. Current mitigation status is visible at /sys/devices/system/cpu/vulnerabilities/spectre_v2.

Neither of these is a VMM bug in the conventional sense. They are properties of the physical hardware, and the software mitigation for both converges on the same recommendation in Firecracker's production host setup guide: disable SMT (Hyper-Threading) entirely. With SMT disabled, no sibling logical processor can speculatively read L1D contents belonging to another tenant. The L1D flush MSR then becomes a belt-and-suspenders measure rather than the primary defense.

A complete production host mitigation table, drawn from Firecracker's prod-host-setup.md:

Mitigation Mechanism Threat addressed
Disable SMT Kernel boot parameter or BIOS Cross-tenant L1D leakage via sibling threads (L1TF)
Disable KSM echo 0 > /sys/kernel/mm/ksm/run Cross-tenant timing attacks via page deduplication
Disable swap swapoff -a Guest memory remanence on storage media
DDR4 with TRR + ECC Hardware selection Rowhammer
kvm.nx_huge_pages=never Kernel module parameter iTLB multihit regression (Linux 6.1, x86-64)
Updated CPU microcode Distribution security updates All speculative execution side channels

Note: Disabling KSM (/sys/kernel/mm/ksm/run) and swap require root on the host. Consult your platform's hardening guide before making these changes in production.

On ARM, Firecracker resets the CNTPCT physical counter only when KVM_CAP_COUNTER_OFFSET is available, which requires kernel 6.4 or later.

Snapshot Trust and Operational Hazards

Snapshot files — the VM state file, the memory snapshot, and the disk image — are classified as trusted in Firecracker's threat model. This is not a strong guarantee. Firecracker applies a 64-bit CRC to the VM state file for partial corruption detection; it does not authenticate or encrypt snapshot content, and the CRC covers only the state file, not the memory snapshot or the disk image. All three files must be independently secured by the operator — an attacker who can modify a snapshot file can inject arbitrary guest state.

Resuming identical snapshots multiple times creates a subtler hazard: UUID collisions, reuse of entropy pool state, repeated cryptographic tokens, and reused RNG seeds across multiple resumed instances. If snapshot triggering is exposed to customers, operators must enforce disk quotas to prevent DoS via unbounded snapshot files.

There are two configuration hazards worth naming before a deployment reaches production. The 8250 serial device can cause unbounded memory and storage usage on the host if guest output is not rate-limited; the production guidance is to disable it with the kernel command line argument 8250.nr_uarts=0. The MMDS (MicroVM Metadata Service) is accessible from the guest at 169.254.169.254 by default; operators must block it at the host with an nftables rule targeting TAP interfaces:

Note: The commands below modify the host firewall. The firecracker table and filter chain must exist before adding the rule; create them once if they do not (see prod-host-setup.md for the full setup). Verify the rule does not conflict with existing nftables rulesets before applying.

nft add table ip firecracker
nft add chain ip firecracker filter { type filter hook forward priority 0 \; }
nft add rule ip firecracker filter iifname "tap*" ip daddr 169.254.169.254 counter drop

The threat model fixes the adversary's position and the defender's posture; Chapter 21 covers how to validate both in practice, using automated policy checks and runtime attestation to confirm that the barriers described here are actually in place on a production host.

Sources And Further Reading