Chapter 19: Seccomp In Firecracker
Imagine a guest has found a bug in KVM. The hypervisor boundary held, but the
guest is now executing arbitrary code inside a vCPU thread on the host. That
thread is a regular Linux thread. Without further containment it can call any
syscall the process is permitted to call: socket, execve, fork, mount.
The kernel will evaluate each call on the host, in the context of the
Firecracker process, with the firewall rules and network interfaces the host has
configured. A guest that can issue syscalls on the host is a guest that has
escaped.
This is the problem seccomp solves in Firecracker. Not by blocking the guest from breaking the KVM barrier — that is VT-x and SVM's job — but by making the prize on the other side of the barrier worthless. If a compromised vCPU thread can only call the twenty-four syscalls it actually needs, the host's attack surface from that thread shrinks to those twenty-four calls and nothing else.
Firecracker's design document makes this explicit: all vCPU threads are treated as running malicious code from the moment they start. The seccomp filter is not a defense of last resort. It is the expected containment path for a thread that has been taken over by a guest.
The Kernel Mechanism: seccomp(2) BPF
The tool Firecracker uses is seccomp(2) in filter mode
(SECCOMP_SET_MODE_FILTER), available since Linux 3.17 as syscall number 317
on x86-64. A caller passes a classic BPF (cBPF) program to the kernel via this
syscall; from that point on, every syscall entry for that thread runs the BPF
program first. The program's return value decides the call's fate before the
kernel even looks at the syscall number.
The BPF program receives a single argument: a pointer to a struct seccomp_data
in the kernel's address space:
struct seccomp_data {
int nr; /* syscall number */
__u32 arch; /* AUDIT_ARCH_* value */
__u64 instruction_pointer; /* RIP at time of syscall */
__u64 args[6]; /* up to 6 syscall arguments */
};
The arch field is not cosmetic. On x86-64, a 32-bit i386 compat call with
nr == 0 resolves to restart_syscall, whereas the same nr on the x86-64
native ABI resolves to read. A filter that checks nr without first
confirming arch == AUDIT_ARCH_X86_64 can be bypassed by routing a dangerous
call through the wrong ABI. Firecracker's compiled filters always check the
architecture first.
The return value from the BPF program is one of eight actions, in priority order from highest to lowest:
| Action | Hex value | Effect |
|---|---|---|
SECCOMP_RET_KILL_PROCESS |
0x80000000 |
Kill entire process (Linux 4.14+) |
SECCOMP_RET_KILL_THREAD |
0x00000000 |
Kill calling thread |
SECCOMP_RET_TRAP |
0x00030000 |
Deliver SIGSYS |
SECCOMP_RET_ERRNO |
0x00050000 |
Return errno (low 16 bits) |
SECCOMP_RET_USER_NOTIF |
0x7fc00000 |
Notify userspace fd (Linux 5.0+) |
SECCOMP_RET_TRACE |
0x7ff00000 |
Notify ptrace tracer |
SECCOMP_RET_LOG |
0x7ffc0000 |
Allow and log (Linux 4.14+) |
SECCOMP_RET_ALLOW |
0x7fff0000 |
Permit syscall |
When a thread has multiple chained filters — because seccomp(2) was called
more than once — the kernel runs all of them and uses the action with the
highest priority. The kernel enforces a hard cap of 4,096 BPF instructions per
individual program (BPF_MAX_LEN) and 32,768 instructions across all chained
filters for a single thread (MAX_INSNS_PER_PATH).
Before installing a filter, the calling thread must either hold CAP_SYS_ADMIN
in its user namespace or have called prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0).
The no-new-privs flag prevents the thread and any process it exec-s from gaining
privileges, so there is no privilege-escalation path the filter needs to worry
about. Firecracker always takes this path; it never requires CAP_SYS_ADMIN to
install its filters.
Three Threads, Three Filters
Firecracker runs as three categories of threads, each with a fundamentally different job and a fundamentally different syscall requirement:
The filter key for each thread — "vmm", "api", "vcpu" — maps to a
separate BPF program compiled from the policy file at
resources/seccomp/x86_64-unknown-linux-musl.json. None of the three programs
is a subset of another; each is independently constructed to match what that
thread actually calls.
The timing of filter installation matters. The VMM's filter goes in right before
the event loop starts. The API thread's filter is installed right before the
HTTP listener starts. The vCPU filter has the tightest constraint: the exact
call chain is start_threaded() spawning a thread that calls Vcpu::run(filter),
which calls apply_filter(filter) before StateMachine::run() and therefore
before any KVM_RUN ioctl reaches the kernel. The filter is in place before the
first guest instruction executes.
If apply_filter fails on a vCPU thread, Firecracker panics immediately. There
is no fallback and no soft failure. A vCPU that cannot install its seccomp filter
is a vCPU that must not run.
From JSON Policy to Embedded BPF
The compilation pipeline runs at build time, not at Firecracker startup. The
policy file is resources/seccomp/x86_64-unknown-linux-musl.json; the compiled
output is an embedded binary blob in the Firecracker executable.
The JSON root is an object keyed on thread name. Each value is a filter
configuration with three fields: default_action, which applies when no rule
matches; filter_action, which applies when a rule matches; and filter, the
array of allowed syscall rules. All three filters in the x86-64 default policy
use "default_action": "trap" and "filter_action": "allow". An unmatched
syscall delivers SIGSYS to the Firecracker process, which its signal handler
catches, logs, and converts to a controlled shutdown — more useful for
post-incident analysis than SECCOMP_RET_KILL, which leaves no trace.
A rule entry is a JSON object with a "syscall" name and an optional "args"
array:
{
"vcpu": {
"default_action": "trap",
"filter_action": "allow",
"filter": [
{ "syscall": "read" },
{
"syscall": "ioctl",
"args": [
{ "index": 1, "type": "dword", "op": "eq", "val": 44672,
"comment": "KVM_RUN" }
]
}
]
}
}
A rule without "args" is a name-only match: any invocation of that syscall
passes. A rule with "args" is an argument-level match: every condition in
the array must be true simultaneously (AND semantics), and multiple rules for
the same syscall are ORed.
Each condition object has five fields:
| Field | Type | Values |
|---|---|---|
index |
integer | 0–5, the position in args[6] |
type |
string | "dword" (32-bit) or "qword" (64-bit) |
op |
string | "eq", "ge", "gt", "lt", "le", "ne", "masked_eq" |
val |
integer | base-10 decimal |
comment |
string | annotation only, ignored by compiler |
The musl ioctl Quirk
When "type": "dword" is used with an "eq" comparison, the seccompiler does
not emit a plain 64-bit equality check. It emits a SCMP_CMP_MASKED_EQ with
the mask 0x00000000FFFFFFFF, which zeroes the upper 32 bits before comparing.
The reason: musl's ioctl wrapper leaves garbage in the upper 32 bits of the
request argument. A plain 64-bit eq on KVM_RUN (decimal 44672,
0xAE80) would reject legitimate calls whose upper bits happen to be non-zero.
This is not a theoretical concern — it causes spurious SIGSYS signals in
practice. The masking makes the comparison match what the kernel actually sees in
seccomp_data.args[1].
The Compilation Steps
-
Parse
x86_64-unknown-linux-musl.jsonviaserde_jsoninto aBTreeMap<String, FilterConfig>. The use ofBTreeMaprather thanHashMapis deliberate:HashMaprandomizes iteration order across builds, producing different BPF bytecode from identical inputs. PR #5298 fixed this by switching toBTreeMap, which guarantees deterministic key order and reproducible binaries. -
For each thread key, call
seccomp_init(default_action)to create a libseccomp context, thenseccomp_arch_add()with the target architecture constant:SCMP_ARCH_X86_64 = 0xc000003efor x86-64 orSCMP_ARCH_AARCH64 = 0xc00000b7for ARM. -
Resolve each syscall name to a number via
seccomp_syscall_resolve_name(), then add it withseccomp_rule_add()for name-only rules orseccomp_rule_add_array()for rules with argument conditions. -
Export the BPF bytecode via
seccomp_export_bpf()to an anonymousmemfdand read it back asVec<u64>. -
Serialize the full
BTreeMap<String, Vec<u64>>with thebitcodecrate and write the result to disk.
Firecracker v1.11.0 (PR #4926, merged 2025-01-16) migrated the compiler backend from a hand-written Rust BPF code generator to libseccomp. The migration reduced BPF program size by approximately 65%, which matters because a larger program means more instructions to evaluate at every syscall entry on every vCPU thread, every millisecond the guest runs.
Runtime Installation
At startup the Firecracker process reads the embedded bitcode blob and
deserializes it. The deserialization is capped at DESERIALIZATION_BYTES_LIMIT =
100,000 bytes to prevent an oversized custom filter file from triggering an
unbounded allocation. The result is a map of thread name to Vec<u64> BPF
programs.
Each thread's filter is installed by apply_filter(bpf_filter: &[u64]) in
src/vmm/src/seccomp.rs, which performs exactly four operations:
- If
bpf_filteris empty, returnOk(()). This path is used for debug builds and the--no-seccompflag. - Call
prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0). - Construct a
SockFprogwithlen = bpf_filter.len() as u16andfilter = bpf_filter.as_ptr(). - Call
syscall(SYS_seccomp, SECCOMP_SET_MODE_FILTER, 0, &sock_fprog). Flags are zero: noSECCOMP_FILTER_FLAG_TSYNC, noLOG, noNEW_LISTENER.
BPF instructions are stored as u64, not as sock_filter. A BPF instruction is
8 bytes with 4-byte alignment; u64 satisfies both constraints without requiring
a custom struct at the call site. The kernel accepts the pointer regardless of
which type the userspace caller uses.
The zero flag is intentional. SECCOMP_FILTER_FLAG_TSYNC (1UL << 0) would
propagate the filter to all other threads in the process simultaneously;
Firecracker does not use it. Instead it installs each thread's filter from
within that thread, at the moment that thread reaches its operational steady
state. The three BPF programs are different, so thread-sync installation would
apply the wrong program to the wrong threads.
The x86-64 Default Allowlists
vCPU Thread: The Tightest List
The vCPU thread allows approximately 24 distinct syscalls. Its ioctl entry is
not a single rule but 19 separate rules, each fixing args[1] to a specific
request value. Selected entries from the compiled policy:
| Syscall | Arg condition | Value | Comment |
|---|---|---|---|
ioctl |
args[1] dword eq |
44672 |
KVM_RUN |
ioctl |
args[1] dword eq |
44707 |
KVM_GET_TSC_KHZ |
ioctl |
args[1] dword eq |
44717 |
KVM_KVMCLOCK_CTRL |
ioctl |
args[1] dword eq |
2147790488 |
KVM_GET_MP_STATE |
ioctl |
args[1] dword eq |
2151722655 |
KVM_GET_VCPU_EVENTS |
ioctl |
args[1] dword eq |
3221794440 |
KVM_GET_MSRS |
ioctl |
args[1] dword eq |
3221794449 |
KVM_GET_CPUID2 |
mmap |
name-only | — | — |
futex |
name-only | — | — |
timerfd_settime |
name-only | — | — |
The vCPU allowlist does not include socket, connect, accept4, bind,
fork, clone, execve, mount, or any io_uring_* call. An ioctl with
any request value outside the 19-entry list hits default_action: trap. The
thread can run a guest, field exits, and adjust timers. It cannot open a network
connection.
API Thread: Networking Without KVM
The API thread allows approximately 26 syscalls. Its purpose is to receive and
respond to the Firecracker management API over a Unix socket, so it needs
accept4, recvfrom, recvmsg, sendto, socket, epoll_pwait, and
getrandom. It does not include KVM_RUN or any KVM VM-level ioctl. A
compromised API thread can read and write on the management socket; it cannot
trigger guest execution.
VMM Thread: The Broadest List
The VMM thread manages virtio device I/O, KVM VM setup, and the tap network
interface, so its allowlist is the most permissive of the three: approximately
44 distinct syscalls. It includes io_uring_enter, io_uring_setup, and
io_uring_register for io_uring-backed block devices; connect, socket,
sendmsg, and recvmsg for the tap interface; and 14 distinct KVM ioctl
request values covering VM-level operations such as KVM_GET_DIRTY_LOG,
KVM_SET_USER_MEMORY_REGION, and KVM_IRQFD. Even at 44 calls, this is a
fraction of the full Linux syscall table. The thread that orchestrates everything
cannot, for instance, create a new process or modify the host's UID.
Name-Only vs. Argument-Level Filtering
The distinction between a name-only rule and an argument-level rule is not just a matter of precision — it is the difference between containing a class of calls and containing a specific operation.
A name-only read rule allows read on any file descriptor for any number of
bytes. That is acceptable for a thread that should be able to read its event
pipe, its timer fd, and its tap device fd — the policy author has decided that
any read from this thread is legitimate. A name-only ioctl rule would be
catastrophic: it would allow KVM_RUN and also TIOCSTI (inject bytes into a
terminal), PTRACE_TRACEME encapsulated in ioctl, or any other driver-specific
request the kernel understands.
The vCPU's ioctl entry is therefore 19 argument-level rules. Each one allows
exactly one request value. Any ioctl whose second argument does not appear in
that list — regardless of how the call is framed or what file descriptor it
targets — hits default_action: trap. The argument filter makes the syscall
permission specific enough to be meaningful.
The BPF cost of argument-level filtering is real. Each condition compiles to
multiple BPF instructions: a load from the appropriate offset in seccomp_data,
a masking step (for dword conditions), and a comparison. Nineteen conditions
for ioctl alone is a substantial BPF program, and the total must stay under
4,096 instructions per filter. The 65% BPF size reduction from the libseccomp
migration in v1.11.0 is what makes the current depth of argument filtering
practical within that budget.
The --basic flag on seccompiler-bin, which stripped all argument conditions
and produced name-only allowlists, was deprecated and removed. There is no longer
a "basic mode" that silently downgrades policy precision; every compiled filter
uses the same engine, and every argument condition in the JSON reaches the BPF
program.
How Seccomp Shrinks the Host Attack Surface
Hardware virtualization is Firecracker's first containment layer. VT-x and SVM enforce the boundary between guest ring 0 and host ring 0 in silicon; a guest running in VMX non-root mode cannot issue a host syscall directly. But that boundary has bugs. KVM has had privilege escalation CVEs, and the entire history of hypervisor security suggests that "assume the hardware boundary is unbreakable" is not a viable threat model at scale.
The Jailer (firecracker-jailer) is a separate binary that wraps the Firecracker
process: it drops privileges, sets up a PID namespace, configures cgroups, and
calls pivot_root before exec-ing firecracker. What the Jailer does not do is
install any seccomp filter on the Firecracker process. Seccomp installation is
Firecracker's own responsibility, performed per-thread from within the process.
Seccomp is the second containment layer, sitting between the hardware boundary and the Jailer's namespace isolation:
None of the three layers depends on the others being intact. If KVM's boundary holds, the guest never reaches seccomp. If KVM's boundary is breached, seccomp limits the thread to its allowlist. If both are breached, the Jailer's cgroup limits, restricted namespace, and dropped UID constrain what the process can reach on the host. Each layer independently limits the blast radius of a compromise at its own level.
Concretely: a vCPU thread that has been taken over by a guest escape and is now
running attacker code on the host cannot open a socket, cannot fork, cannot exec
a new binary, cannot load a kernel module, cannot mount a filesystem, and cannot
issue a KVM VM-level ioctl that only the VMM thread is permitted to issue. What
it can do is call KVM_RUN again — which puts it back in the guest.
The argument-level ioctl filter is the sharpest edge of this. Without it,
a vCPU thread that could call ioctl with an arbitrary request argument could
issue KVM_CREATE_VM (a VM-level ioctl), KVM_SET_USER_MEMORY_REGION (which
maps host memory into the guest), or TIOCSTI. With it, the only ioctls
available are the 19 vCPU-specific operations in the allowlist.
Two Distinct Seccomp Layers
Firecracker's seccomp filters are host-side: they restrict syscalls made by the
Firecracker process against the host kernel. When Firecracker hosts containers
via firecracker-containerd, a second independent seccomp layer operates inside
the guest:
| Layer | Filters | Installed by | Evaluated by |
|---|---|---|---|
| Firecracker host seccomp | Host syscalls from the Firecracker process | Firecracker itself | Host kernel |
| OCI/runc seccomp | Guest syscalls from container workloads | runc inside the guest |
Guest kernel |
Both layers use the identical kernel mechanism: seccomp(SECCOMP_SET_MODE_FILTER,
flags, &sock_fprog) on x86-64. The OCI runtime spec's seccomp field maps
SCMP_ACT_* action names and SCMP_CMP_* operators to the same kernel
constants described in this chapter; the underlying BPF evaluation is identical.
The two layers do not interact: the host kernel evaluates the Firecracker filter,
and the guest kernel evaluates the container filter. A call that passes the guest
filter but would fail the host filter never reaches the host — the guest kernel
handles it without issuing a hypercall.
The containerd book covers the OCI/runc seccomp path in detail.
Custom Filters and the --no-seccomp Flag
Note:
--no-seccompdisables all BPF filtering. Firecracker's documentation states explicitly: "Do not use in production." A Firecracker process running without seccomp has no second containment layer and relies solely on KVM hardware isolation. Running with--no-seccompon a host that serves untrusted workloads removes a significant portion of the defense-in-depth design.
Two CLI flags control filter selection:
--no-seccomp: passes an emptybpf_filterslice toapply_filter, which returns immediately at step 1 without installing any filter.--seccomp-filter <path>: loads a pre-compiled bitcode-serialized.bpffile, overriding the embedded defaults. This file is produced byseccompiler-bin --target-arch x86_64 --input-file policy.json --output-file out.bpf. Custom filters are useful for experimental GNU libc targets (which ship without embedded default filters), debug builds, and rapid production mitigation without recompiling Firecracker.
seccompiler-bin also accepts --split-output (writes one <thread>.bpf file
per thread, used in the test suite) and --target-arch (x86_64 | aarch64).
A Brief History
Firecracker's seccomp policy has changed only a handful of times since the initial open-source release:
| Version | Change |
|---|---|
| v0.13.0 | Changed the default from seccomp-level 0 (no filtering) to level 2 (full argument-level filtering) for all threads. |
| v0.15.1 | Added madvise to the VMM allowlist after musl's allocator called it under memory pressure, causing random VM terminations. |
| v1.11.0 | Migrated the compiler backend from hand-written Rust BPF code generation to libseccomp (PR #4926, merged 2025-01-16); added the empty BPF slice path for debug builds. |
| PR #5298 | Replaced HashMap with BTreeMap in the compilation pipeline, fixing a reproducibility bug where identical inputs produced different BPF bytecode across builds. |
The v0.15.1 madvise incident is a useful calibration. A tight seccomp allowlist
is not a set-and-forget policy; it is a contract between the policy and the
implementation. When the implementation changes — a new dependency, a new libc
version, a new code path under memory pressure — the policy must change with it.
The cost of getting this wrong is an unexplained VM termination, not a helpful
error message. That is a steep debugging tax, and it is the price of running with
a minimal allowlist.
Sources And Further Reading
- Firecracker seccomp documentation: https://github.com/firecracker-microvm/firecracker/blob/main/docs/seccomp.md
- Firecracker seccompiler documentation: https://github.com/firecracker-microvm/firecracker/blob/main/docs/seccompiler.md
- x86-64 default policy file: https://github.com/firecracker-microvm/firecracker/blob/main/resources/seccomp/x86_64-unknown-linux-musl.json
- Firecracker
apply_filterimplementation: https://github.com/firecracker-microvm/firecracker/blob/main/src/vmm/src/seccomp.rs - Firecracker seccompiler library: https://github.com/firecracker-microvm/firecracker/blob/main/src/seccompiler/src/lib.rs
SeccompCondition::to_scmp_type()(dword/qword masking): https://github.com/firecracker-microvm/firecracker/blob/main/src/seccompiler/src/types.rs- Firecracker design document: https://github.com/firecracker-microvm/firecracker/blob/main/docs/design.md
- Jailer documentation: https://github.com/firecracker-microvm/firecracker/blob/main/docs/jailer.md
- PR #4926 — libseccomp backend migration: https://github.com/firecracker-microvm/firecracker/pull/4926
- PR #5298 — BTreeMap for reproducible BPF output: https://github.com/firecracker-microvm/firecracker/pull/5298
seccomp(2)man page: https://man7.org/linux/man-pages/man2/seccomp.2.html- Kernel seccomp filter documentation: https://www.kernel.org/doc/html/latest/userspace-api/seccomp_filter.html
include/uapi/linux/seccomp.h— action constants: https://github.com/torvalds/linux/blob/master/include/uapi/linux/seccomp.h- x86-64 syscall table: https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl
- OCI Runtime Spec, Linux seccomp field: https://github.com/opencontainers/runtime-spec/blob/main/config-linux.md
firecracker-containerd: https://github.com/firecracker-microvm/firecracker-containerdrust-vmm/seccompilercrate (v0.5.0, released 2025-03-07): https://github.com/rust-vmm/seccompiler/blob/main/src/lib.rs