Chapter 11: virtio — The Paravirtualized Device Model

Every device the guest needs — a network card, a disk, an entropy source — has to go through the VMM. The question is how. The naive answer is emulation: the VMM impersonates a real piece of hardware, the guest drives it with an unmodified driver, and the VMM translates guest I/O port writes and MMIO accesses into host operations. QEMU's emulation of the 82093AA I/OAPIC, the Intel 82576 Gigabit Ethernet controller, and a dozen other real chips is how most full VMs run. It is also slow, not because the emulation itself is expensive, but because the interface was designed for a device that has its own DMA engine, its own FIFO, and a long latency to silicon. The guest issues dozens of register writes to queue one operation, each of which exits the CPU into the VMM and back. The overhead is not in the emulated logic; it is in the round-trip count.

Paravirtualization cuts the round-trip count by giving up the pretense of real hardware. The guest runs a driver that knows it is talking to a VMM, and the protocol between them is designed for that context: large batches, a shared memory ring for communication, and a single notification per batch rather than one per operation. The physical-device illusion disappears; in its place is an explicit contract between the driver and the device backend.

virtio is that contract, standardized. The OASIS virtio Committee Specification v1.2 CS01 (published 1 July 2022) defines the shared-memory ring format, the feature negotiation handshake, the two transports (MMIO and PCI), and the wire protocol for each device class. It is the interface that Firecracker, crosvm, Cloud Hypervisor, and QEMU all implement — which means a Linux guest compiled once can run on any of them without modification, because the driver it loads is the kernel's standard virtio driver, not a VMM-specific one.

The Virtqueue

Every virtio device exposes one or more virtqueues: shared-memory rings through which the driver (the guest's kernel driver) submits work and the device (the VMM backend) returns completions. The virtio v1.2 spec defines two queue formats. The split virtqueue (spec section 2.7) uses three separate memory regions. The packed virtqueue (spec section 2.8), introduced in v1.1 and enabled by feature bit VIRTIO_F_RING_PACKED = 34, collapses those three regions into one circular ring plus two small event-suppression structures. Firecracker implements split virtqueues only; the packed format is not supported.

Three Rings, Three Owners

A split virtqueue consists of three physically independent memory regions, each owned exclusively by one side:

Guest Memory

 ┌──────────────────────────────────────────────────────┐
 │  Descriptor Table  (16 bytes × Queue Size)           │◄── Driver writes
 │  Alignment: 16 bytes                                  │    Device reads
 ├──────────────────────────────────────────────────────┤
 │  Available Ring    (6 + 2 × Queue Size bytes)        │◄── Driver writes
 │  Alignment: 2 bytes                                   │    Device reads
 ├──────────────────────────────────────────────────────┤
 │  Used Ring         (6 + 8 × Queue Size bytes)        │◄── Device writes
 │  Alignment: 4 bytes                                   │    Driver reads
 └──────────────────────────────────────────────────────┘

The alignment constants are defined in linux/include/uapi/linux/virtio_ring.h as VRING_DESC_ALIGN_SIZE = 16, VRING_AVAIL_ALIGN_SIZE = 2, and VRING_USED_ALIGN_SIZE = 4. Queue Size must be a power of two, at least 1, and at most 32,768 (0x8000). Firecracker caps every queue at 256 entries (FIRECRACKER_MAX_QUEUE_SIZE = 256).

The ownership rule is strict: the driver never writes to the used ring; the device never writes to the available ring or the descriptor table. This means accesses never race between writer and reader on the same memory. There is still a concurrency hazard — the guest and the VMM run concurrently — but it is bounded to the index fields, which the spec addresses with explicit memory barrier requirements.

The Descriptor Table

Each entry in the descriptor table is a struct virtq_desc (16 bytes, all fields in little-endian):

struct virtq_desc {
    le64 addr;   /* offset  0: guest-physical buffer address */
    le32 len;    /* offset  8: buffer length in bytes */
    le16 flags;  /* offset 12: control flags */
    le16 next;   /* offset 14: index of next descriptor (if chaining) */
};

Three flag bits control how the descriptor is used. VIRTQ_DESC_F_NEXT = 0x1 means the descriptor is not the last in a chain — the next field holds the index of the next descriptor. VIRTQ_DESC_F_WRITE = 0x2 marks the buffer as device-writable; without it the buffer is device-readable. VIRTQ_DESC_F_INDIRECT = 0x4 signals that addr and len point not to data but to an in-memory table of further virtq_desc entries, enabled by feature bit VIRTIO_F_RING_INDIRECT_DESC = 28. Within an indirect table, only VIRTQ_DESC_F_WRITE and VIRTQ_DESC_F_NEXT are valid; VIRTQ_DESC_F_INDIRECT is forbidden in indirect entries, and the device must ignore VIRTQ_DESC_F_WRITE on the outer descriptor that points to the table.

Descriptors chain together to describe a single I/O request. A virtio-blk read, for example, uses three descriptors in a chain: a device-readable header (16 bytes: request type, reserved padding, sector number), one or more device-writable data buffers, and a device-writable one-byte status field. All device-readable descriptors precede all device-writable ones in the chain — this is a hard split-virtqueue rule, not a convention.

The driver builds these chains by filling descriptor table entries, then publishes the chain by placing the head descriptor's index into the available ring.

The Available Ring

The available ring is the driver's outbox. Its layout (from spec section 2.7.6):

struct virtq_avail {
    le16 flags;                   /* VIRTQ_AVAIL_F_NO_INTERRUPT = 1 */
    le16 idx;                     /* where driver will write next head index */
    le16 ring[/* Queue Size */];  /* head indices of published chains */
    le16 used_event;              /* only if VIRTIO_F_EVENT_IDX negotiated */
};

The idx field wraps naturally at 2^16. The driver increments it by the number of chains it publishes, stores the head indices in ring[idx % QueueSize] through ring[(idx + n - 1) % QueueSize], then issues a write memory barrier before notifying the device. The device reads ring[(last_seen_idx % QueueSize)] through ring[(avail.idx - 1) % QueueSize] to collect new chains.

Notice that idx is never reset — it grows monotonically, modulo 2^16. A device that tracks the last idx it saw can detect new work without any locking; the index is the only synchronization signal.

The Used Ring

The used ring is the device's completion outbox. Its layout (section 2.7.8):

struct virtq_used { le16 flags; /* VIRTQ_USED_F_NO_NOTIFY = 1 */ le16 idx; /* where device will write next */ struct virtq_used_elem ring[/* Queue Size */]; le16 avail_event; /* only if VIRTIO_F_EVENT_IDX */ }; struct virtq_used_elem { le32 id; /* head index of the completed descriptor chain */ le32 len; /* total bytes written by device into writable buffers */ };

Each virtq_used_elem is 8 bytes. When the device finishes a chain, it writes the head index and byte count into the current used slot, increments idx, and — unless notification suppression says otherwise — signals the guest interrupt. The driver scans from its last-seen idx to used.idx - 1 to harvest completions.

Notification Suppression

Left to themselves, driver and device fire an interrupt or a doorbell write after every descriptor batch. For high-throughput paths, that overhead adds up. virtio provides two suppression mechanisms.

The coarse mechanism uses the binary flags: the driver sets avail.flags = VIRTQ_AVAIL_F_NO_INTERRUPT to suppress device-to-driver interrupts; the device sets used.flags = VIRTQ_USED_F_NO_NOTIFY to suppress driver-to-device kicks. Either side can assert its flag at any time. The tradeoff is crude — all notifications or none.

The fine-grained mechanism, enabled by VIRTIO_F_RING_EVENT_IDX = 29, uses threshold fields instead of binary flags. The driver places a target idx value into avail.used_event; the device fires an interrupt only when used.idx reaches that value. The device places a target into used.avail_event; the driver kicks only when avail.idx reaches it. This lets either side defer a notification precisely until the peer has enough work queued to justify waking up. Firecracker implements the EVENT_IDX path and validates the notification-suppression logic and 16-bit index wraparound with Kani formal proofs.

The virtio-queue Crate

Firecracker's virtqueue implementation lives in the rust-vmm virtio-queue crate (published at https://crates.io/crates/virtio-queue). The crate provides two queue types: Queue for single-threaded use and QueueSync (Arc<Mutex<Queue>>) for shared access, both implementing the QueueT trait. Key methods include set_desc_table_address, set_avail_ring_address, set_used_ring_address, set_size, set_ready, set_event_idx, is_valid, add_used, needs_notification, disable_notification, and enable_notification. AvailIter is a consuming iterator over available descriptor chain heads; DescriptorChain with DescriptorChainRwIter separates readable from writable segments cleanly.

The crate uses Rust read_volatile and write_volatile with explicit acquire/release memory fences for every ring access, matching the spec's barrier requirements without relying on the compiler to infer them. A Time-To-Live counter limits chain traversal depth to prevent infinite loops from a malicious guest that crafts a circular chain. Used-ring notifications are batched via prepare_kick() rather than checked after each add_used() call — the crate documents this as a deliberate deviation from spec section 2.6.7.2. The crate targets the virtio v1.1 split virtqueue spec.

Feature Negotiation

The spec imposes a strict handshake before the device becomes usable. This is the mechanism by which a driver compiled three years ago negotiates with a device model compiled last week: each side publishes what it supports; the intersection is what they use. Neither side assumes the other is current.

The Nine-Step Sequence

Spec section 3.1.1 defines the mandatory initialization sequence. The driver must follow these steps in order:

sequenceDiagram participant D as Guest Driver participant Dev as virtio Device D->>Dev: Write 0 to Status (reset) D->>Dev: Set ACKNOWLEDGE (1) in Status D->>Dev: Set DRIVER (2) in Status D->>Dev: Read DeviceFeatures page 0 (bits 0-31) D->>Dev: Read DeviceFeatures page 1 (bits 32-63) D->>Dev: Write negotiated bits to DriverFeatures D->>Dev: Set FEATURES_OK (8) in Status D->>Dev: Read back Status Dev-->>D: Status returned Note over D: If FEATURES_OK absent → abort (-ENODEV) D->>Dev: Configure virtqueues (QueueSel, QueueNum, ring GPAs, QueueReady=1) D->>Dev: Set DRIVER_OK (4) in Status Note over Dev: Device is now live

Features are read and written in two 32-bit pages via DeviceFeaturesSel and DriverFeaturesSel: page 0 covers bits 0–31, page 1 covers bits 32–63. This matters in practice because VIRTIO_F_VERSION_1 = 32 sits at bit 0 of page 1. A device presenting itself as modern must advertise this bit; a driver that does not acknowledge it is treated as a legacy driver, and a v2 MMIO device must reject initialization if the driver fails to acknowledge it.

The six device status register bits (from linux/include/uapi/linux/virtio_config.h) are the handshake signals:

Constant Value Meaning
VIRTIO_CONFIG_S_ACKNOWLEDGE 1 Driver found the device
VIRTIO_CONFIG_S_DRIVER 2 Driver knows how to drive it
VIRTIO_CONFIG_S_FEATURES_OK 8 Feature negotiation complete
VIRTIO_CONFIG_S_DRIVER_OK 4 Driver is live
VIRTIO_CONFIG_S_NEEDS_RESET 64 Device needs reset (unrecoverable)
VIRTIO_CONFIG_S_FAILED 128 Fatal error

Status starts at 0. The driver must not clear individual bits; only writing 0 resets the register and the device.

Kernel Implementation

virtio_dev_probe() in drivers/virtio/virtio.c implements steps 2–7: it sets DRIVER, calls virtio_get_features(), ANDs the device and driver feature tables, calls dev->config->finalize_features(), sets FEATURES_OK, and reads back status. If FEATURES_OK is absent, it returns -ENODEV. virtio_device_ready() sets DRIVER_OK after queue setup completes. virtio_features_ok() in drivers/virtio/virtio.c checks that VIRTIO_F_VERSION_1 is in the negotiated set before writing DriverFeatures to a modern device.

The Transport-Layer Feature Bits

Most of these bits live in the range VIRTIO_TRANSPORT_F_START = 28 through VIRTIO_TRANSPORT_F_END = 42 and apply to every device type. VIRTIO_F_ANY_LAYOUT is listed here for completeness — it predates the formal transport range and sits at bit 27, just outside it.

The Linux kernel uapi headers (virtio_ring.h) name the indirect-descriptor and event-index bits VIRTIO_RING_F_INDIRECT_DESC and VIRTIO_RING_F_EVENT_IDX; the OASIS spec uses VIRTIO_F_RING_INDIRECT_DESC and VIRTIO_F_RING_EVENT_IDX for the same bits (28 and 29). This chapter follows the spec naming.

Constant (OASIS spec) Bit Meaning
VIRTIO_F_ANY_LAYOUT 27 Device handles any descriptor ordering (predates transport range)
VIRTIO_F_RING_INDIRECT_DESC 28 Indirect descriptor tables
VIRTIO_F_RING_EVENT_IDX 29 Descriptor-granularity notification suppression
VIRTIO_F_VERSION_1 32 Modern device (mandatory for modern devices)
VIRTIO_F_ACCESS_PLATFORM 33 IOMMU DMA required
VIRTIO_F_RING_PACKED 34 Packed virtqueue format
VIRTIO_F_IN_ORDER 35 Buffers used in availability order
VIRTIO_F_RING_RESET 40 Per-queue reset

Firecracker advertises VIRTIO_F_VERSION_1 and VIRTIO_F_RING_EVENT_IDX on all its devices. VIRTIO_F_RING_PACKED is never advertised because Firecracker does not implement packed virtqueues.

Config Space Atomicity

Device-specific configuration fields (capacity, MAC address, queue pair count, and so on) live in a config space region that can be updated at any time — for example, a network link-state change arriving mid-probe. Spec section 2.5 requires the driver to re-read config fields in a compare-and-retry loop using the config_generation field (MMIO offset 0x0fc) whenever a concurrent change is suspected. The device increments config_generation before and after each config update; if the driver reads a different value at the end of a read sequence than at the beginning, it retries.

The MMIO Transport

The MMIO transport (spec section 4.2) exposes the device as a flat register window mapped into the guest's physical address space. There is no bus, no enumeration protocol, no capability list — just a base address and an IRQ number that the VMM communicates to the guest out-of-band.

Register Map

All registers are 4 bytes wide, 4-byte-aligned, at fixed offsets from the base address (from linux/include/uapi/linux/virtio_mmio.h):

Register Offset Dir Purpose
MagicValue 0x000 RO Must read 0x74726976 ("virt" in LE ASCII)
Version 0x004 RO 2 = modern; 1 = legacy
DeviceID 0x008 RO virtio device type
VendorID 0x00c RO Vendor identifier
DeviceFeatures 0x010 RO 32-bit feature page
DeviceFeaturesSel 0x014 WO Feature page selector (0 or 1)
DriverFeatures 0x020 WO Accepted feature bits
DriverFeaturesSel 0x024 WO Driver feature page selector
QueueSel 0x030 WO Select active queue (0-indexed)
QueueNumMax 0x034 RO Maximum queue size
QueueNum 0x038 WO Actual queue size (driver chooses)
QueueReady 0x044 RW Write 1 to activate queue
QueueNotify 0x050 WO Write queue index to kick device
InterruptStatus 0x060 RO Bit 0 = used-buffer; bit 1 = config change
InterruptACK 0x064 WO Acknowledge interrupt bits
Status 0x070 RW Device status register
QueueDescLow 0x080 WO Descriptor Table GPA bits 31:0
QueueDescHigh 0x084 WO Descriptor Table GPA bits 63:32
QueueAvailLow 0x090 WO Available Ring GPA bits 31:0
QueueAvailHigh 0x094 WO Available Ring GPA bits 63:32
QueueUsedLow 0x0a0 WO Used Ring GPA bits 31:0
QueueUsedHigh 0x0a4 WO Used Ring GPA bits 63:32
ConfigGeneration 0x0fc RO Config space atomicity counter
Config 0x100+ RW Device-specific config (up to 0xfff)

The legacy (Version 1) layout adds GuestPageSize at 0x028, QueueAlign at 0x03c, and QueuePFN at 0x040, and collapses the split 64-bit address pairs into a single page-frame number. Modern drivers do not touch these.

Device Discovery

MMIO has no self-describing discovery mechanism (spec section 4.2.1). The guest must learn each device's base address and IRQ from the VMM. Linux's drivers/virtio/virtio_mmio.c driver supports three paths: a device tree node with compatible = "virtio,mmio", a kernel command-line parameter virtio_mmio.device=<size>@<baseaddr>:<irq>[:<id>] (requires CONFIG_VIRTIO_MMIO_CMDLINE_DEVICES), and static platform device registration in board code.

Firecracker uses the command-line path: it appends one virtio_mmio.device=... entry per device to the kernel command line at boot, advertising each device's MMIO window size, base address, and IRQ. An open issue (#2519) proposes replacing this with a device tree blob passed via the setup_data boot protocol field, but as of this writing the issue is not merged.

Firecracker's MMIO Backend

Firecracker's MMIO transport is implemented in src/vmm/src/devices/virtio/transport/mmio.rs as MmioTransport, which implements the BusDevice trait. Guest writes to MMIO space cause VM exits; the VMM dispatches them through BusDevice::read and BusDevice::write.

A few Firecracker-specific constants are worth naming. MMIO_VERSION = 2 is hardcoded — the device always presents as modern. VENDOR_ID = 0 deviates from the spec's recommended value of 0x1AF4 (Red Hat, Inc.) and mirrors the crosvm convention. Reading DeviceFeatures with DeviceFeaturesSel = 1 ORs in 0x1 unconditionally, so VIRTIO_F_VERSION_1 (bit 32) is always visible to the driver regardless of what the inner device model advertises. set_device_status() enforces the spec state machine with a VALID_TRANSITIONS table; any status write that is not a legal transition logs a warning. Transition to DRIVER_OK calls locked_device().activate(), which hands control to the device backend — at that point, the virtqueues are live and the device can begin processing descriptors.

The guest kernel requires CONFIG_VIRTIO_MMIO=y and CONFIG_VIRTIO_MMIO_CMDLINE_DEVICES=y to use Firecracker MMIO devices.

The PCI Transport

The PCI transport (spec section 4.1) is self-describing. The guest scans the PCI bus, finds devices with Vendor ID 0x1AF4 (Red Hat, Inc.), and walks each device's PCI capability list to find the five vendor-specific capability structures that virtio-PCI defines. No out-of-band communication is needed: the bus itself tells the driver where everything is.

Device IDs

PCI device IDs split into two ranges. Legacy (transitional) devices use IDs 0x10000x103F; modern devices use 0x1040 + the virtio device ID, so virtio-net is 0x1041, virtio-blk is 0x1042, virtio-rng is 0x1044, and virtio-vsock is 0x1053.

Five Capability Structures

Each capability uses cap_vndr = PCI_CAP_ID_VNDR (identifying it as a vendor-specific capability) and a cfg_type field that says which of the five roles it plays (from linux/include/uapi/linux/virtio_pci.h):

cfg_type Value Purpose
VIRTIO_PCI_CAP_COMMON_CFG 1 Common configuration struct (virtio_pci_common_cfg)
VIRTIO_PCI_CAP_NOTIFY_CFG 2 Queue doorbell addresses
VIRTIO_PCI_CAP_ISR_CFG 3 Interrupt status byte
VIRTIO_PCI_CAP_DEVICE_CFG 4 Device-specific configuration
VIRTIO_PCI_CAP_PCI_CFG 5 Alternative PCI config-space access window

struct virtio_pci_cap records which BAR holds the region (bar, 0–5), the byte offset within that BAR (offset), and the region's length (length). The notification capability also carries notify_off_multiplier; the doorbell address for queue N is cap.offset + queue_notify_off × notify_off_multiplier.

struct virtio_pci_common_cfg exposes the feature selectors and data fields, the queue count, the device status register, config_generation, the queue selector and size, queue_enable, and the split 64-bit GPA fields queue_desc_lo/hi, queue_avail_lo/hi, and queue_used_lo/hi — a superset of the MMIO register map, accessed through a memory-mapped struct rather than individual register offsets.

Why Firecracker Originally Chose MMIO

PCI bus enumeration, ACPI table parsing, and MSI/MSI-X interrupt wiring each add work to the guest boot path. For Firecracker's original target — the serverless VM that must start in under 150 ms — those milliseconds matter. MMIO requires none of that infrastructure, the device model is simpler, and the command-line discovery mechanism is a handful of string appends.

PCI transport was later added to Firecracker behind --enable-pci. Benchmarks from the Firecracker team (discussion #4845) show the tradeoff concretely: block synchronous reads improve by about 50%, block synchronous writes by 46% (on a 1-vCPU VM), network transmit throughput by 2–11%, and network receive throughput by 9–17%. Latency drops roughly 27%. The cost is an approximately 8% slower boot on VMs under 4 GiB. The fundamental reason is interrupt delivery: MMIO uses level-triggered interrupts that require a VM exit per notification; MSI-X can deliver interrupts without a VMM-side exit. PCI is the right answer when throughput dominates; MMIO is the right answer when boot time does.

Enabling PCI mode requires additional guest kernel configuration: CONFIG_PCI, CONFIG_PCI_MMCONFIG, CONFIG_PCI_MSI, CONFIG_PCIEPORTBUS, CONFIG_VIRTIO_PCI, CONFIG_BLK_MQ_PCI, CONFIG_PCI_HOST_COMMON, and CONFIG_PCI_HOST_GENERIC. The guest must not pass pci=off on its command line.

The Devices That Matter

The five device types Firecracker exposes cover everything a modern serverless workload needs: a network path, a block device, a host-guest socket channel, memory pressure signaling, and entropy. Each is a separate protocol layered on top of the virtqueue machinery.

virtio-net (Device ID 1)

The network device presents the guest with an Ethernet interface backed by a TAP device on the host.

TAP setup. Firecracker opens /dev/net/tun with O_RDWR | O_NONBLOCK | O_CLOEXEC and calls TUNSETIFF (_IOW('T', 202, int)) with three flags: IFF_TAP = 0x0002 (Ethernet frames, not raw IP), IFF_NO_PI = 0x1000 (do not prepend the four-byte struct tun_pi packet info header), and IFF_VNET_HDR = 0x4000 (prepend or strip a virtio_net_hdr on each frame). Two more ioctls complete the setup: TUNSETOFFLOAD (_IOW('T', 208, unsigned int)) advertises which checksum offloads the tap device can handle, and TUNSETVNETHDRSZ (_IOW('T', 216, int)) tells the kernel to use the 12-byte virtio_net_hdr_v1 format rather than the legacy 10-byte form. Interface name is a 16-byte array matching IFNAMSIZ. These constants and structures are defined in linux/include/uapi/linux/if_tun.h.

Before running commands that open /dev/net/tun or manage TAP devices, the process needs either CAP_NET_ADMIN or a pre-created TAP interface. On a production host Firecracker relies on the jailer to set up the TAP before dropping privileges; on a development machine, sudo ip tuntap add dev tap0 mode tap creates one manually.

The virtio-net header. Every frame crossing the TAP/virtqueue boundary carries a virtio_net_hdr_v1 (12 bytes) that describes the offload state of the packet (defined in linux/include/uapi/linux/virtio_net.h):

Offset Field Notes
0 flags VIRTIO_NET_HDR_F_NEEDS_CSUM = 1
1 gso_type NONE=0, TCPV4=1, UDP=3, TCPV6=4, ECN flag=0x80
2–3 hdr_len Total L2+L3+L4 header length
4–5 gso_size Desired MSS for segmentation
6–7 csum_start Byte offset where checksum computation begins
8–9 csum_offset Offset from csum_start to place the checksum
10–11 num_buffers Merged receive buffer count (if VIRTIO_NET_F_MRG_RXBUF)

Queues. Firecracker implements exactly two virtqueues: RX_INDEX = 0 and TX_INDEX = 1, each capped at NET_QUEUE_MAX_SIZE = 256 descriptors. The spec allows a multi-queue extension (VIRTIO_NET_F_MQ = 22) with one transmit and one receive queue per CPU, but Firecracker does not implement it; each virtio-net device has a single queue pair. MAX_BUFFER_SIZE = 65562 bytes (64 KiB plus the virtio-net header overhead) is the largest receive buffer the device will accept.

Feature bits Firecracker advertises. From linux/include/uapi/linux/virtio_net.h, Firecracker sets: VIRTIO_NET_F_CSUM (0), VIRTIO_NET_F_GUEST_CSUM (1), VIRTIO_NET_F_GUEST_TSO4 (7), VIRTIO_NET_F_GUEST_TSO6 (8), VIRTIO_NET_F_GUEST_UFO (10), VIRTIO_NET_F_HOST_TSO4 (11), VIRTIO_NET_F_HOST_TSO6 (12), VIRTIO_NET_F_HOST_UFO (14), VIRTIO_NET_F_MRG_RXBUF (15), plus VIRTIO_F_RING_EVENT_IDX (29) and VIRTIO_F_VERSION_1 (32). VIRTIO_NET_F_MAC (5) is added when a MAC address is configured; VIRTIO_NET_F_MTU (3) when an MTU override is set.

The config space struct (virtio_net_config) carries MAC (6 bytes), status (2 bytes), max_virtqueue_pairs (2 bytes), MTU (2 bytes), speed in Mbps (4 bytes), and duplex (1 byte).

virtio-blk (Device ID 2)

The block device exposes a single virtqueue to the guest. Firecracker implements BLOCK_NUM_QUEUES = 1, sized to 256 descriptors. IO_URING_NUM_ENTRIES = 128 (half the queue depth) because one block request typically spans two to three descriptors; a full 256-entry submission ring would overflow an io_uring ring of the same size.

The sector model. SECTOR_SIZE = 512 bytes (1 << 9). The capacity field in the config struct is a u64 reporting the total sector count. A guest trying to read sector N at offset N × 512 from the start of the backing file or device.

Request layout. Each I/O request is a three-descriptor chain:

  1. A 16-byte device-readable header: type (u32), reserved (u32), sector (u64).
  2. One or more data buffers — device-readable for writes, device-writable for reads.
  3. A one-byte device-writable status field: VIRTIO_BLK_S_OK = 0, VIRTIO_BLK_S_IOERR = 1, or VIRTIO_BLK_S_UNSUPP = 2.

The type field in the header selects the operation: VIRTIO_BLK_T_IN = 0 (read), VIRTIO_BLK_T_OUT = 1 (write), VIRTIO_BLK_T_FLUSH = 4 (cache flush), VIRTIO_BLK_T_GET_ID = 8 (identify device: returns a 20-byte ASCII string). All defined in linux/include/uapi/linux/virtio_blk.h.

Feature bits Firecracker advertises. VIRTIO_F_VERSION_1 (32) and VIRTIO_F_RING_EVENT_IDX (29) always. VIRTIO_BLK_F_FLUSH (9) when the backing disk is in writeback-cache mode. VIRTIO_BLK_F_RO (5) when the disk is read-only.

virtio-vsock (Device ID 19)

vsock gives the guest and host a socket channel without a network interface. The guest opens a socket with socket(AF_VSOCK, SOCK_STREAM, 0) and addresses the host by its well-known CID. This is the channel Firecracker uses for its API proxy feature and for guest agent communication in richer microVM platforms.

The address family AF_VSOCK was introduced in Linux 4.8. Each endpoint is addressed by a (CID, port) pair. Reserved CIDs: VMADDR_CID_HYPERVISOR = 0, VMADDR_CID_LOCAL = 1, VMADDR_CID_HOST = 2, VMADDR_CID_ANY = 0xFFFFFFFF. Firecracker sets VSOCK_HOST_CID = 2 for the host-side endpoint.

Queues. Firecracker implements three queues (VSOCK_NUM_QUEUES = 3): RXQ (index 0) for data from host to guest, TXQ (index 1) for data from guest to host, and EVQ (index 2) for event messages. All three are sized to 256 descriptors. Each descriptor chain encodes exactly one vsock packet: a 44-byte header followed by an optional payload up to MAX_PKT_BUF_SIZE = 65536 bytes.

The header. virtio_vsock_hdr is 44 bytes, packed:

Offset Field Type Notes
0–7 src_cid le64 Source context ID
8–15 dst_cid le64 Destination context ID
16–19 src_port le32 Source port
20–23 dst_port le32 Destination port
24–27 len le32 Payload byte count
28–29 type le16 Socket type (1=STREAM, 2=SEQPACKET)
30–31 op le16 Operation code
32–35 flags le32 Operation-specific flags
36–39 buf_alloc le32 Receiver buffer allocation (flow control)
40–43 fwd_cnt le32 Bytes consumed by receiver (flow control)

The op field drives the connection state machine. VIRTIO_VSOCK_OP_REQUEST = 1 initiates a connection; VIRTIO_VSOCK_OP_RESPONSE = 2 accepts it; VIRTIO_VSOCK_OP_RST = 3 rejects or aborts; VIRTIO_VSOCK_OP_SHUTDOWN = 4 begins a graceful close; VIRTIO_VSOCK_OP_RW = 5 carries data; VIRTIO_VSOCK_OP_CREDIT_UPDATE = 6 and VIRTIO_VSOCK_OP_CREDIT_REQUEST = 7 implement receive-window flow control through buf_alloc and fwd_cnt in the header — the receiver advertises available buffer space, and the sender tracks how much of it it has consumed.

Feature bits. Firecracker advertises VIRTIO_F_VERSION_1 (32), VIRTIO_F_IN_ORDER (35), and VIRTIO_F_RING_EVENT_IDX (29), combined as AVAIL_FEATURES = (1 << 32) | (1 << 35) | (1 << 29). The device-specific feature VIRTIO_VSOCK_F_SEQPACKET = 1 (SOCK_SEQPACKET support) is not advertised.

virtio-balloon (Device ID 5)

The balloon device lets the host reclaim memory from a running guest without stopping it. The host writes a target page count into the num_pages field of the config struct; the guest driver inflates the balloon — pins that many 4 KiB pages, reports them to the host, and the host calls madvise(MADV_DONTNEED) on those guest-physical ranges. The guest kernel can no longer access them efficiently; the host OS reclaims the physical pages. When the host writes a smaller target, the guest deflates: it releases the pinned pages back to its own allocator and the host stops MADV_DONTNEED-ing them. Re-access by the guest returns zero-filled pages on demand.

This is the mechanism by which Firecracker supports memory overcommit: a fleet of microVMs can collectively commit more memory than the host has, and the balloon keeps actual physical usage within bounds.

Config struct. virtio_balloon_config (from linux/include/uapi/linux/virtio_balloon.h):

Field Type Meaning
num_pages le32 Host-requested balloon size in 4 KiB pages
actual le32 Current balloon size in 4 KiB pages
free_page_hint_cmd_id le32 Command ID for free page hinting
poison_val le32 Page poison value

All balloon accounting is in 4 KiB pages; 256 pages equals 1 MiB. The host sets num_pages; the guest updates actual as it completes inflation or deflation.

Queues. The inflate queue (index 0) and deflate queue (index 1) are always present, sized to 128 descriptors in Firecracker. Additional queues appear conditionally: the stats queue (index 2) when VIRTIO_BALLOON_F_STATS_VQ is negotiated, a free-page-hinting queue (index 3) when VIRTIO_BALLOON_F_FREE_PAGE_HINT is set, and a page-reporting queue (index 4) when VIRTIO_BALLOON_F_REPORTING is set.

Feature bits Firecracker advertises. VIRTIO_F_VERSION_1 (32), VIRTIO_BALLOON_F_DEFLATE_ON_OOM (2) (the guest automatically deflates if the host OOM killer fires), VIRTIO_BALLOON_F_STATS_VQ (1), VIRTIO_BALLOON_F_FREE_PAGE_HINT (3), and VIRTIO_BALLOON_F_REPORTING (5).

Statistics. The stats queue carries tagged 8-byte entries with a u16 tag and a u64 value. The defined tags include swap-in and swap-out counts (VIRTIO_BALLOON_S_SWAP_IN = 0, VIRTIO_BALLOON_S_SWAP_OUT = 1), page fault counts (VIRTIO_BALLOON_S_MAJFLT = 2, VIRTIO_BALLOON_S_MINFLT = 3), free and total memory (VIRTIO_BALLOON_S_MEMFREE = 4, VIRTIO_BALLOON_S_MEMTOT = 5), available memory (VIRTIO_BALLOON_S_AVAIL = 6), page cache size (VIRTIO_BALLOON_S_CACHES = 7), and OOM kill count (VIRTIO_BALLOON_S_OOM_KILL = 10). The statistics polling interval is configurable in Firecracker; setting it to 0 disables polling. The guest kernel requires CONFIG_MEMORY_BALLOON=y and CONFIG_VIRTIO_BALLOON=y.

virtio-rng (Device ID 4)

The entropy device is the simplest virtio device in the spec: linux/include/uapi/linux/virtio_rng.h contains no device-specific feature bits — it includes only virtio_ids.h and virtio_config.h. There is no device-specific config space. The entire protocol fits in a paragraph.

Firecracker exposes one queue (RNG_NUM_QUEUES = 1) and advertises only VIRTIO_F_VERSION_1 (32). The queue direction is device-writable only: the guest posts write-only descriptors pointing to buffers it wants filled with entropy, and the device fills them. The guest never sends data to the device.

MAX_ENTROPY_BYTES = 65536 (64 KiB) is the cap on bytes served per request, preventing host memory exhaustion from a malicious guest that crafts overlapping descriptor chains pointing to enormous ranges. Firecracker draws entropy from aws_lc_rs::rand (the AWS LibCrypto Rust bindings), not from /dev/random or getrandom() directly. Rate limiting is available via the Firecracker API, with independent controls for bytes-per-second and operations-per-second.

The simplicity is the point. A random number device has no protocol state, no connection setup, no flow control, and no error conditions beyond buffer exhaustion. It is what virtio looks like when nothing is left to remove.

Wiring It Together

flowchart TB
    gk["Guest Kernel Driver"]
    vq["Virtqueue<br/>(desc / avail / used rings)"]
    mmio["virtio-MMIO or PCI Transport"]
    vmm["VMM Device Backend<br/>(Firecracker)"]
    tap["TAP /dev/net/tun"]
    blkfile["Block backing file"]
    vsockunix["Unix socket<br/>(vsock muxer)"]
    awslc["aws_lc_rs::rand"]

    gk -->|"write head idx to avail.ring,\nkick QueueNotify"| vq
    vq -->|"VM exit on register write"| mmio
    mmio -->|"dispatch to activate()d device"| vmm
    vmm -->|"net: read/write virtio_net_hdr + frame"| tap
    vmm -->|"blk: read/write 512-byte sectors"| blkfile
    vmm -->|"vsock: virtio_vsock_hdr + payload"| vsockunix
    vmm -->|"rng: fill entropy bytes"| awslc
    vmm -->|"write id+len to used.ring,\ninterrupt guest"| vq
    vq -->|"driver polls used.idx"| gk

Every device Firecracker ships uses the same queue depth (256), the same notification suppression path (VIRTIO_F_RING_EVENT_IDX), and the same memory fence discipline (read_volatile/write_volatile with acquire/release barriers). The per-device protocol differences are entirely in the descriptor chain layout and the config space struct — the queue machinery is shared.

Sources And Further Reading