Chapter 19: Seccomp In Firecracker

Imagine a guest has found a bug in KVM. The hypervisor boundary held, but the guest is now executing arbitrary code inside a vCPU thread on the host. That thread is a regular Linux thread. Without further containment it can call any syscall the process is permitted to call: socket, execve, fork, mount. The kernel will evaluate each call on the host, in the context of the Firecracker process, with the firewall rules and network interfaces the host has configured. A guest that can issue syscalls on the host is a guest that has escaped.

This is the problem seccomp solves in Firecracker. Not by blocking the guest from breaking the KVM barrier — that is VT-x and SVM's job — but by making the prize on the other side of the barrier worthless. If a compromised vCPU thread can only call the twenty-four syscalls it actually needs, the host's attack surface from that thread shrinks to those twenty-four calls and nothing else.

Firecracker's design document makes this explicit: all vCPU threads are treated as running malicious code from the moment they start. The seccomp filter is not a defense of last resort. It is the expected containment path for a thread that has been taken over by a guest.

The Kernel Mechanism: seccomp(2) BPF

The tool Firecracker uses is seccomp(2) in filter mode (SECCOMP_SET_MODE_FILTER), available since Linux 3.17 as syscall number 317 on x86-64. A caller passes a classic BPF (cBPF) program to the kernel via this syscall; from that point on, every syscall entry for that thread runs the BPF program first. The program's return value decides the call's fate before the kernel even looks at the syscall number.

The BPF program receives a single argument: a pointer to a struct seccomp_data in the kernel's address space:

struct seccomp_data {
    int   nr;                  /* syscall number */
    __u32 arch;                /* AUDIT_ARCH_* value */
    __u64 instruction_pointer; /* RIP at time of syscall */
    __u64 args[6];             /* up to 6 syscall arguments */
};

The arch field is not cosmetic. On x86-64, a 32-bit i386 compat call with nr == 0 resolves to restart_syscall, whereas the same nr on the x86-64 native ABI resolves to read. A filter that checks nr without first confirming arch == AUDIT_ARCH_X86_64 can be bypassed by routing a dangerous call through the wrong ABI. Firecracker's compiled filters always check the architecture first.

The return value from the BPF program is one of eight actions, in priority order from highest to lowest:

Action Hex value Effect
SECCOMP_RET_KILL_PROCESS 0x80000000 Kill entire process (Linux 4.14+)
SECCOMP_RET_KILL_THREAD 0x00000000 Kill calling thread
SECCOMP_RET_TRAP 0x00030000 Deliver SIGSYS
SECCOMP_RET_ERRNO 0x00050000 Return errno (low 16 bits)
SECCOMP_RET_USER_NOTIF 0x7fc00000 Notify userspace fd (Linux 5.0+)
SECCOMP_RET_TRACE 0x7ff00000 Notify ptrace tracer
SECCOMP_RET_LOG 0x7ffc0000 Allow and log (Linux 4.14+)
SECCOMP_RET_ALLOW 0x7fff0000 Permit syscall

When a thread has multiple chained filters — because seccomp(2) was called more than once — the kernel runs all of them and uses the action with the highest priority. The kernel enforces a hard cap of 4,096 BPF instructions per individual program (BPF_MAX_LEN) and 32,768 instructions across all chained filters for a single thread (MAX_INSNS_PER_PATH).

Before installing a filter, the calling thread must either hold CAP_SYS_ADMIN in its user namespace or have called prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0). The no-new-privs flag prevents the thread and any process it exec-s from gaining privileges, so there is no privilege-escalation path the filter needs to worry about. Firecracker always takes this path; it never requires CAP_SYS_ADMIN to install its filters.

Three Threads, Three Filters

Firecracker runs as three categories of threads, each with a fundamentally different job and a fundamentally different syscall requirement:

flowchart TD FC["firecracker process"] VMM["VMM thread\n~44 allowed syscalls\nKVM VM setup, virtio I/O"] API["API thread\n~26 allowed syscalls\nHTTP listener"] VCPU["vCPU thread (one per vCPU)\n~24 allowed syscalls\nKVM_RUN loop"] FC --> VMM FC --> API FC --> VCPU

The filter key for each thread — "vmm", "api", "vcpu" — maps to a separate BPF program compiled from the policy file at resources/seccomp/x86_64-unknown-linux-musl.json. None of the three programs is a subset of another; each is independently constructed to match what that thread actually calls.

The timing of filter installation matters. The VMM's filter goes in right before the event loop starts. The API thread's filter is installed right before the HTTP listener starts. The vCPU filter has the tightest constraint: the exact call chain is start_threaded() spawning a thread that calls Vcpu::run(filter), which calls apply_filter(filter) before StateMachine::run() and therefore before any KVM_RUN ioctl reaches the kernel. The filter is in place before the first guest instruction executes.

If apply_filter fails on a vCPU thread, Firecracker panics immediately. There is no fallback and no soft failure. A vCPU that cannot install its seccomp filter is a vCPU that must not run.

From JSON Policy to Embedded BPF

The compilation pipeline runs at build time, not at Firecracker startup. The policy file is resources/seccomp/x86_64-unknown-linux-musl.json; the compiled output is an embedded binary blob in the Firecracker executable.

The JSON root is an object keyed on thread name. Each value is a filter configuration with three fields: default_action, which applies when no rule matches; filter_action, which applies when a rule matches; and filter, the array of allowed syscall rules. All three filters in the x86-64 default policy use "default_action": "trap" and "filter_action": "allow". An unmatched syscall delivers SIGSYS to the Firecracker process, which its signal handler catches, logs, and converts to a controlled shutdown — more useful for post-incident analysis than SECCOMP_RET_KILL, which leaves no trace.

A rule entry is a JSON object with a "syscall" name and an optional "args" array:

{
  "vcpu": {
    "default_action": "trap",
    "filter_action": "allow",
    "filter": [
      { "syscall": "read" },
      {
        "syscall": "ioctl",
        "args": [
          { "index": 1, "type": "dword", "op": "eq", "val": 44672,
            "comment": "KVM_RUN" }
        ]
      }
    ]
  }
}

A rule without "args" is a name-only match: any invocation of that syscall passes. A rule with "args" is an argument-level match: every condition in the array must be true simultaneously (AND semantics), and multiple rules for the same syscall are ORed.

Each condition object has five fields:

Field Type Values
index integer 0–5, the position in args[6]
type string "dword" (32-bit) or "qword" (64-bit)
op string "eq", "ge", "gt", "lt", "le", "ne", "masked_eq"
val integer base-10 decimal
comment string annotation only, ignored by compiler

The musl ioctl Quirk

When "type": "dword" is used with an "eq" comparison, the seccompiler does not emit a plain 64-bit equality check. It emits a SCMP_CMP_MASKED_EQ with the mask 0x00000000FFFFFFFF, which zeroes the upper 32 bits before comparing. The reason: musl's ioctl wrapper leaves garbage in the upper 32 bits of the request argument. A plain 64-bit eq on KVM_RUN (decimal 44672, 0xAE80) would reject legitimate calls whose upper bits happen to be non-zero. This is not a theoretical concern — it causes spurious SIGSYS signals in practice. The masking makes the comparison match what the kernel actually sees in seccomp_data.args[1].

The Compilation Steps

  1. Parse x86_64-unknown-linux-musl.json via serde_json into a BTreeMap<String, FilterConfig>. The use of BTreeMap rather than HashMap is deliberate: HashMap randomizes iteration order across builds, producing different BPF bytecode from identical inputs. PR #5298 fixed this by switching to BTreeMap, which guarantees deterministic key order and reproducible binaries.

  2. For each thread key, call seccomp_init(default_action) to create a libseccomp context, then seccomp_arch_add() with the target architecture constant: SCMP_ARCH_X86_64 = 0xc000003e for x86-64 or SCMP_ARCH_AARCH64 = 0xc00000b7 for ARM.

  3. Resolve each syscall name to a number via seccomp_syscall_resolve_name(), then add it with seccomp_rule_add() for name-only rules or seccomp_rule_add_array() for rules with argument conditions.

  4. Export the BPF bytecode via seccomp_export_bpf() to an anonymous memfd and read it back as Vec<u64>.

  5. Serialize the full BTreeMap<String, Vec<u64>> with the bitcode crate and write the result to disk.

Firecracker v1.11.0 (PR #4926, merged 2025-01-16) migrated the compiler backend from a hand-written Rust BPF code generator to libseccomp. The migration reduced BPF program size by approximately 65%, which matters because a larger program means more instructions to evaluate at every syscall entry on every vCPU thread, every millisecond the guest runs.

Runtime Installation

At startup the Firecracker process reads the embedded bitcode blob and deserializes it. The deserialization is capped at DESERIALIZATION_BYTES_LIMIT = 100,000 bytes to prevent an oversized custom filter file from triggering an unbounded allocation. The result is a map of thread name to Vec<u64> BPF programs.

Each thread's filter is installed by apply_filter(bpf_filter: &[u64]) in src/vmm/src/seccomp.rs, which performs exactly four operations:

  1. If bpf_filter is empty, return Ok(()). This path is used for debug builds and the --no-seccomp flag.
  2. Call prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0).
  3. Construct a SockFprog with len = bpf_filter.len() as u16 and filter = bpf_filter.as_ptr().
  4. Call syscall(SYS_seccomp, SECCOMP_SET_MODE_FILTER, 0, &sock_fprog). Flags are zero: no SECCOMP_FILTER_FLAG_TSYNC, no LOG, no NEW_LISTENER.

BPF instructions are stored as u64, not as sock_filter. A BPF instruction is 8 bytes with 4-byte alignment; u64 satisfies both constraints without requiring a custom struct at the call site. The kernel accepts the pointer regardless of which type the userspace caller uses.

The zero flag is intentional. SECCOMP_FILTER_FLAG_TSYNC (1UL << 0) would propagate the filter to all other threads in the process simultaneously; Firecracker does not use it. Instead it installs each thread's filter from within that thread, at the moment that thread reaches its operational steady state. The three BPF programs are different, so thread-sync installation would apply the wrong program to the wrong threads.

The x86-64 Default Allowlists

vCPU Thread: The Tightest List

The vCPU thread allows approximately 24 distinct syscalls. Its ioctl entry is not a single rule but 19 separate rules, each fixing args[1] to a specific request value. Selected entries from the compiled policy:

Syscall Arg condition Value Comment
ioctl args[1] dword eq 44672 KVM_RUN
ioctl args[1] dword eq 44707 KVM_GET_TSC_KHZ
ioctl args[1] dword eq 44717 KVM_KVMCLOCK_CTRL
ioctl args[1] dword eq 2147790488 KVM_GET_MP_STATE
ioctl args[1] dword eq 2151722655 KVM_GET_VCPU_EVENTS
ioctl args[1] dword eq 3221794440 KVM_GET_MSRS
ioctl args[1] dword eq 3221794449 KVM_GET_CPUID2
mmap name-only
futex name-only
timerfd_settime name-only

The vCPU allowlist does not include socket, connect, accept4, bind, fork, clone, execve, mount, or any io_uring_* call. An ioctl with any request value outside the 19-entry list hits default_action: trap. The thread can run a guest, field exits, and adjust timers. It cannot open a network connection.

API Thread: Networking Without KVM

The API thread allows approximately 26 syscalls. Its purpose is to receive and respond to the Firecracker management API over a Unix socket, so it needs accept4, recvfrom, recvmsg, sendto, socket, epoll_pwait, and getrandom. It does not include KVM_RUN or any KVM VM-level ioctl. A compromised API thread can read and write on the management socket; it cannot trigger guest execution.

VMM Thread: The Broadest List

The VMM thread manages virtio device I/O, KVM VM setup, and the tap network interface, so its allowlist is the most permissive of the three: approximately 44 distinct syscalls. It includes io_uring_enter, io_uring_setup, and io_uring_register for io_uring-backed block devices; connect, socket, sendmsg, and recvmsg for the tap interface; and 14 distinct KVM ioctl request values covering VM-level operations such as KVM_GET_DIRTY_LOG, KVM_SET_USER_MEMORY_REGION, and KVM_IRQFD. Even at 44 calls, this is a fraction of the full Linux syscall table. The thread that orchestrates everything cannot, for instance, create a new process or modify the host's UID.

Name-Only vs. Argument-Level Filtering

The distinction between a name-only rule and an argument-level rule is not just a matter of precision — it is the difference between containing a class of calls and containing a specific operation.

A name-only read rule allows read on any file descriptor for any number of bytes. That is acceptable for a thread that should be able to read its event pipe, its timer fd, and its tap device fd — the policy author has decided that any read from this thread is legitimate. A name-only ioctl rule would be catastrophic: it would allow KVM_RUN and also TIOCSTI (inject bytes into a terminal), PTRACE_TRACEME encapsulated in ioctl, or any other driver-specific request the kernel understands.

The vCPU's ioctl entry is therefore 19 argument-level rules. Each one allows exactly one request value. Any ioctl whose second argument does not appear in that list — regardless of how the call is framed or what file descriptor it targets — hits default_action: trap. The argument filter makes the syscall permission specific enough to be meaningful.

The BPF cost of argument-level filtering is real. Each condition compiles to multiple BPF instructions: a load from the appropriate offset in seccomp_data, a masking step (for dword conditions), and a comparison. Nineteen conditions for ioctl alone is a substantial BPF program, and the total must stay under 4,096 instructions per filter. The 65% BPF size reduction from the libseccomp migration in v1.11.0 is what makes the current depth of argument filtering practical within that budget.

The --basic flag on seccompiler-bin, which stripped all argument conditions and produced name-only allowlists, was deprecated and removed. There is no longer a "basic mode" that silently downgrades policy precision; every compiled filter uses the same engine, and every argument condition in the JSON reaches the BPF program.

How Seccomp Shrinks the Host Attack Surface

Hardware virtualization is Firecracker's first containment layer. VT-x and SVM enforce the boundary between guest ring 0 and host ring 0 in silicon; a guest running in VMX non-root mode cannot issue a host syscall directly. But that boundary has bugs. KVM has had privilege escalation CVEs, and the entire history of hypervisor security suggests that "assume the hardware boundary is unbreakable" is not a viable threat model at scale.

The Jailer (firecracker-jailer) is a separate binary that wraps the Firecracker process: it drops privileges, sets up a PID namespace, configures cgroups, and calls pivot_root before exec-ing firecracker. What the Jailer does not do is install any seccomp filter on the Firecracker process. Seccomp installation is Firecracker's own responsibility, performed per-thread from within the process.

Seccomp is the second containment layer, sitting between the hardware boundary and the Jailer's namespace isolation:

flowchart TD Guest["Guest (VMX non-root)"] KVM["KVM / VT-x boundary\n(first layer)"] Seccomp["seccomp-BPF per-thread allowlists\n(second layer)"] Jailer["Jailer: PID ns, cgroups, pivot_root\n(third layer)"] Host["Host kernel"] Guest --> KVM KVM --> Seccomp Seccomp --> Jailer Jailer --> Host

None of the three layers depends on the others being intact. If KVM's boundary holds, the guest never reaches seccomp. If KVM's boundary is breached, seccomp limits the thread to its allowlist. If both are breached, the Jailer's cgroup limits, restricted namespace, and dropped UID constrain what the process can reach on the host. Each layer independently limits the blast radius of a compromise at its own level.

Concretely: a vCPU thread that has been taken over by a guest escape and is now running attacker code on the host cannot open a socket, cannot fork, cannot exec a new binary, cannot load a kernel module, cannot mount a filesystem, and cannot issue a KVM VM-level ioctl that only the VMM thread is permitted to issue. What it can do is call KVM_RUN again — which puts it back in the guest.

The argument-level ioctl filter is the sharpest edge of this. Without it, a vCPU thread that could call ioctl with an arbitrary request argument could issue KVM_CREATE_VM (a VM-level ioctl), KVM_SET_USER_MEMORY_REGION (which maps host memory into the guest), or TIOCSTI. With it, the only ioctls available are the 19 vCPU-specific operations in the allowlist.

Two Distinct Seccomp Layers

Firecracker's seccomp filters are host-side: they restrict syscalls made by the Firecracker process against the host kernel. When Firecracker hosts containers via firecracker-containerd, a second independent seccomp layer operates inside the guest:

Layer Filters Installed by Evaluated by
Firecracker host seccomp Host syscalls from the Firecracker process Firecracker itself Host kernel
OCI/runc seccomp Guest syscalls from container workloads runc inside the guest Guest kernel

Both layers use the identical kernel mechanism: seccomp(SECCOMP_SET_MODE_FILTER, flags, &sock_fprog) on x86-64. The OCI runtime spec's seccomp field maps SCMP_ACT_* action names and SCMP_CMP_* operators to the same kernel constants described in this chapter; the underlying BPF evaluation is identical. The two layers do not interact: the host kernel evaluates the Firecracker filter, and the guest kernel evaluates the container filter. A call that passes the guest filter but would fail the host filter never reaches the host — the guest kernel handles it without issuing a hypercall.

The containerd book covers the OCI/runc seccomp path in detail.

Custom Filters and the --no-seccomp Flag

Note: --no-seccomp disables all BPF filtering. Firecracker's documentation states explicitly: "Do not use in production." A Firecracker process running without seccomp has no second containment layer and relies solely on KVM hardware isolation. Running with --no-seccomp on a host that serves untrusted workloads removes a significant portion of the defense-in-depth design.

Two CLI flags control filter selection:

seccompiler-bin also accepts --split-output (writes one <thread>.bpf file per thread, used in the test suite) and --target-arch (x86_64 | aarch64).

A Brief History

Firecracker's seccomp policy has changed only a handful of times since the initial open-source release:

Version Change
v0.13.0 Changed the default from seccomp-level 0 (no filtering) to level 2 (full argument-level filtering) for all threads.
v0.15.1 Added madvise to the VMM allowlist after musl's allocator called it under memory pressure, causing random VM terminations.
v1.11.0 Migrated the compiler backend from hand-written Rust BPF code generation to libseccomp (PR #4926, merged 2025-01-16); added the empty BPF slice path for debug builds.
PR #5298 Replaced HashMap with BTreeMap in the compilation pipeline, fixing a reproducibility bug where identical inputs produced different BPF bytecode across builds.

The v0.15.1 madvise incident is a useful calibration. A tight seccomp allowlist is not a set-and-forget policy; it is a contract between the policy and the implementation. When the implementation changes — a new dependency, a new libc version, a new code path under memory pressure — the policy must change with it. The cost of getting this wrong is an unexplained VM termination, not a helpful error message. That is a steep debugging tax, and it is the price of running with a minimal allowlist.

Sources And Further Reading