Chapter 4: CPU Virtualization Extensions
Before hardware added explicit virtualization support, a VMM had to catch every
privileged guest instruction by running the guest in user mode and trapping each
fault. The approach was called "trap and emulate," and it worked well enough for
most instructions — but x86 had a category of instructions that were sensitive
without being privileged. Instructions like SGDT, SIDT, and PUSHF read or
modify state that differs between the host and guest, yet they execute silently
at ring 3 instead of faulting. A guest OS issuing SGDT to read the GDT
register would get the host's descriptor-table base, not its own. There was no
trap, so there was nothing to catch. VMware's early x86 hypervisors solved the
problem with binary translation — a JIT compiler that rewrote guest code on the
fly before execution, patching out the troublesome instructions. It worked, but
it required a scanning pass over every basic block — and VMware's own published
benchmarks showed 5–20% overhead on kernel-intensive workloads.
Intel's answer, published in 2005 as VT-x (Virtualization Technology for IA-32/IA-64/x86_64), and AMD's concurrent answer as AMD-V (also called SVM, Secure Virtual Machine), both solve the same problem the same way: add a hardware-managed execution mode where the CPU can run guest ring-0 code natively at full speed, intercept exactly the operations the hypervisor cares about, and save and restore the world in a processor-managed data structure. ARM took a structurally different route, adding a dedicated privilege level — EL2, higher than the OS's EL1 — that sits above the guest and controls what the guest can see. All three mechanisms share the same goal: make guest ring 0 architecturally distinct from host ring 0, enforced by hardware decode logic rather than software convention.
Intel VT-x
Detecting and Entering VMX Mode
The first thing any VMM must do is confirm the CPU supports VMX. That check is
CPUID leaf 1, ECX bit 5: CPUID.01H:ECX.VMX[bit 5] = 1. A zero there
means no VT-x, full stop.
Three setup steps must happen before VMXON will succeed. First, the VMM sets
CR4.VMXE (bit 13). Executing VMXON with that bit clear raises #UD, an
invalid-opcode fault — the instruction does not even exist to the CPU in that
state. Second, the VMM reads IA32_FEATURE_CONTROL (MSR address 0x3A) and
verifies bit 0 (the lock bit) and bit 2 are both set. Bit 0, once written to 1,
is latched until power-on reset; BIOS programs it during POST. Bit 2 authorizes
VMXON outside SMX operation, which is the normal case. If the lock bit is
clear, firmware has not committed the machine to VMX operation and VMXON will
fault with #GP(0). Third, the VMM allocates a 4 KiB-aligned VMXON region,
writes the 31-bit VMCS revision identifier (from IA32_VMX_BASIC MSR 0x480,
bits 30:0) into its first four bytes with bit 31 cleared, and passes its
physical address as the operand to VMXON.
IA32_VMX_BASIC carries a few other fields worth naming. Bits 44:32 give the
allocation size for VMXON and VMCS regions (1–4096 bytes). Bits 53:50 give the
memory type the processor expects for those regions; value 6 means write-back
(WB), the value reported by all processors since Nehalem.
On success, VMXON transitions the logical processor into VMX root
operation. The current-VMCS pointer is set to FFFFFFFF_FFFFFFFFH (no VMCS
active), INIT signals are blocked, and A20M is disabled. The processor will stay
in this mode — handling guest entry and exit — until VMXOFF returns it to
plain IA-32e operation.
VMX Root and Non-Root: A Mode Orthogonal to Rings
The central insight of VT-x is the distinction between VMX root operation and VMX non-root operation. These two modes exist at a level below the familiar ring hierarchy.
In VMX root operation the VMM executes. The full instruction set is available, including every VMX instruction. In VMX non-root operation the guest executes. Certain operations that would proceed normally in root operation instead cause VM exits — a hardware-managed transfer of control back to the VMM. The crucial detail: there is no software-visible register bit that indicates which mode the CPU is in. A guest OS running at CPL 0 in VMX non-root operation cannot read a flag and discover it is being virtualized. The mode exists only in the processor's internal state machine, which is exactly what makes it sound as an isolation boundary.
A guest's attempt to execute VMXOFF does not switch the processor back to
non-VMX mode — it causes a VM exit. The same applies to VMXON, VMLAUNCH,
VMRESUME, VMREAD, and VMWRITE. The VMX instruction set is unconditionally
intercepted in non-root operation; no VMCS control bit can allow a guest to use
it.
The VMCS
Every virtual CPU (vCPU) in a VT-x system is associated with a VMCS —
Virtual Machine Control Structure. The VMCS is a processor-managed data
structure up to 4096 bytes in size (the exact size is read from IA32_VMX_BASIC
bits 44:32). Its internal layout is implementation-specific and never documented
by Intel; the only portable way to read or write a VMCS field is through
VMREAD and VMWRITE, which take a 32-bit field encoding as their operand.
The first eight bytes have a fixed layout. Bytes 0–3 hold the revision
identifier (bits 30:0) and a shadow-VMCS indicator (bit 31): if bit 31 is set,
this is a shadow VMCS used for VMCS shadowing, and VMPTRLD will reject it
unless VMCS shadowing is enabled. Bytes 4–7 are the VMX-abort indicator, written
nonzero by the processor if a VMX abort occurs during a VM exit.
Every VMCS has a launch state — either "clear" (immediately after
VMCLEAR) or "launched" (after a successful VMLAUNCH). This state is internal
to the processor and cannot be read by software; it determines which entry
instruction is legal.
VMCS Field Encodings
Every field is addressed by a 32-bit encoding:
| Bits | Meaning |
|---|---|
0 |
Access type: 0 = full field, 1 = high 32 bits of a 64-bit field |
9:1 |
Index within type/width category |
11:10 |
Type: 0 = control, 1 = VM-exit info, 2 = guest state, 3 = host state |
14:13 |
Width: 0 = 16-bit, 1 = 64-bit, 2 = 32-bit, 3 = natural-width |
A few encodings from arch/x86/include/asm/vmx.h illustrate the scheme.
PIN_BASED_VM_EXEC_CONTROL is 0x4000: type 0 (control), 32-bit, index 0.
EPT_POINTER is 0x201A: type 0 (control), 64-bit. VM_EXIT_REASON is
0x4402: type 1 (VM-exit info, read-only), 32-bit. GUEST_CR0 is 0x6800:
type 2 (guest state), natural-width. HOST_CR0 is 0x6C00: type 3 (host
state), natural-width.
VMCS Logical Groups
The VMCS groups its fields into six logical areas:
| Group | Content |
|---|---|
| Guest-state area | CR0, CR3, CR4, segment selectors, bases, limits, AR bytes, GDTR, IDTR, RIP, RSP, RFLAGS, DR7, IA32_EFER, IA32_PAT, IA32_DEBUGCTL, activity state, interruptibility state, VMCS link pointer, preemption timer value |
| Host-state area | CR0, CR3, CR4, segment selectors (CS/SS/DS/ES/FS/GS/TR), FS/GS/TR/GDTR/IDTR bases, RIP, RSP, IA32_EFER, IA32_PAT, IA32_SYSENTER_CS/ESP/EIP |
| VM-execution control fields | Pin-based controls, primary/secondary processor-based controls, exception bitmap, I/O bitmaps, MSR bitmaps, CR3-target controls, APIC-access address, EPT pointer (0x201A), VPID (0x0000), preemption timer |
| VM-exit control fields | VM-exit controls, MSR-store/load areas and counts |
| VM-entry control fields | VM-entry controls, MSR-load area, event injection field |
| VM-exit information fields (read-only) | Exit reason (0x4402), exit qualification, guest-linear address, guest-physical address, IDT-vectoring info, instruction info, instruction length |
The guest-state area is what the processor saves on every VM exit and restores
on every VM entry. The host-state area is what the processor loads on every VM
exit to hand control back to the VMM. The VMM is responsible for writing
host-state fields correctly before the first VMLAUNCH; if the processor ever
needs to exit and finds garbage in the host-state area, it will jump to a
garbage instruction pointer.
VMCS Management Instructions
VMCLEAR writes the VMCS to memory, sets its launch state to "clear," and
dissociates it from the logical processor. VMPTRLD makes a VMCS the current
VMCS for the logical processor without changing its launch state. VMPTRST
stores the current-VMCS pointer to a memory location. VMREAD and VMWRITE
read and write individual fields of the current VMCS by encoding. These
instructions are only available in VMX root operation; a guest executing any of
them causes a VM exit.
KVM's Use of the VMCS
KVM's VMX backend lives in arch/x86/kvm/vmx/vmx.c. The per-vCPU struct is
struct vcpu_vmx, which embeds struct kvm_vcpu. VMCS bookkeeping is tracked
by struct loaded_vmcs in arch/x86/kvm/vmx/vmcs.h. Its bool launched field
is the flag KVM consults to decide between VMLAUNCH and VMRESUME for the
next guest entry. Its struct vmcs_host_state host_state caches the host CR3,
CR4, GS and FS bases, RSP, and segment selectors that KVM would otherwise have
to re-read on every entry path — VMWRITE is a serializing instruction, so the
savings matter at high vCPU counts.
Nested virtualization (KVM hosting a guest that itself runs VMs) introduces
three VMCS instances per nested vCPU. vmcs01 is what KVM builds for the L1
guest hypervisor during normal non-nested operation. vmcs12 is the VMCS the L1
hypervisor constructs for its L2 nested guest, represented in KVM as struct
vmcs12. vmcs02 is the VMCS KVM actually executes L2 with — it merges the
policies from vmcs01 and vmcs12 so that neither L1 nor KVM can bypass the
other's intercept controls.
VM Entry: VMLAUNCH and VMRESUME
VMLAUNCH (opcode 0F 01 C2) performs the first entry into a VMCS. The
current VMCS must be in "clear" launch state; on success the processor
transitions it to "launched." VMRESUME (opcode 0F 01 C3) performs every
subsequent entry and requires the VMCS to already be in "launched" state. Using
the wrong instruction — VMLAUNCH on a launched VMCS or VMRESUME on a clear
one — produces a VMfailValid with the appropriate error code in the VM-instruction
error field.
Both instructions require VMX root operation at CPL 0, CR0.PE = 1, and
RFLAGS.VM = 0. They also require no MOV-SS or POP-SS blocking to be active.
VM entry proceeds through three check phases. Phase 1 validates VMX controls and
the host-state area; failure causes VMfailValid and leaves guest state
unchanged. Phase 2 validates the guest-state area and PDPTRs; failure causes the
processor to load the host state and transfer to the host RIP, with bit 31 of
the exit-reason field set to indicate a VM-entry failure rather than a true exit.
Phase 3 validates the MSR-load area; failure also loads host state. Only after
all three phases pass does the processor commit to VMX non-root operation and
begin executing guest code.
VM Exits
A VM exit occurs when the guest executes an operation the VMCS execution-control
fields have marked for interception, or when the processor encounters a condition
that mandates host intervention regardless of control settings — triple fault,
NMI, INIT signal, or an external interrupt when external-interrupt exiting is
set. The processor atomically saves guest state into the VMCS guest-state area,
loads host state from the VMCS host-state area, and jumps to the address in the
VMCS host RIP field.
The VM-exit reason field (0x4402, 32-bit, read-only) describes what
happened. Bits 15:0 carry the basic exit reason. Bit 31 distinguishes a true VM
exit (0) from a VM-entry failure that loaded host state (1). Selected basic
reasons:
| Code | Reason |
|---|---|
| 0 | Exception or NMI |
| 1 | External interrupt |
| 2 | Triple fault |
| 10 | CPUID |
| 12 | HLT |
| 18 | VMCALL (hypercall) |
| 28 | CR access |
| 30 | I/O instruction |
| 31 | MSR read |
| 32 | MSR write |
| 48 | EPT violation |
| 49 | EPT misconfiguration |
| 52 | VMX-preemption timer expired |
Exit reasons 48 and 49 deserve a note. They fire when a guest-physical address
cannot be resolved through the Extended Page Tables — either because no mapping
exists (violation) or because a mapping is present but its permission bits are
inconsistent (misconfiguration). Both route to the KVM memory-fault handler,
which either populates the mapping or reflects the fault to user space as a
KVM_EXIT_MMIO exit from KVM_RUN. Chapter 5 covers EPT in detail.
Execution Controls and the MSR Bitmap
The most important tool for tuning VM-exit overhead is the MSR bitmap. When
the primary processor-based VM-execution control "Use MSR bitmaps" (field
0x4002, bit 28) is set, the processor checks the MSR bitmap before deciding
whether an RDMSR or WRMSR causes an exit. The bitmap is a 4 KiB page: four
1 KiB regions cover MSR reads in 0x00000000–0x00001FFF, MSR reads in
0xC0000000–0xC0001FFF, and the corresponding write halves. A set bit means
intercept; a clear bit means pass through to the guest. KVM marks the bitmap
bits for performance-critical MSRs like IA32_TSC (with TSC offsetting active,
field 0x4002 bit 3) to avoid exits on every call to clock_gettime.
The VMX-preemption timer (pin-based control 0x4000 bit 6) provides a
deadline mechanism: a 32-bit counter in the VMCS guest-state area decrements
proportionally to the TSC (at a rate of one decrement per TSC bit-X transition,
where X is read from IA32_VMX_MISC). When the counter reaches zero, a VM exit
fires with reason 52. KVM uses this to implement the vCPU preemption timer for
guests that spin in HLT loops.
Event Injection
To deliver an interrupt or exception to a guest on the next VM entry, the VMM
writes the VM-entry interruption-information field in the VMCS. The 32-bit
layout: bits 7:0 are the vector; bits 10:8 are the type (0 = external interrupt,
2 = NMI, 3 = hardware exception, 4 = software interrupt, 5 = privileged software
exception, 6 = software exception); bit 11 enables error-code delivery; bit 31
marks the field valid. Setting bit 31 to 0 suppresses injection. On the next
VMRESUME, the processor delivers the event as if it had arrived naturally —
through the IDT, with all the privilege checks that entails.
AMD-V (SVM)
Intel shipped the first VT-x–capable processors in November 2005; AMD followed in May 2006 with the Athlon 64 Orleans and Windsor desktop processors, and added Nested Page Tables in the third-generation Opteron "Barcelona" (Family 0x10) in 2007. The architecture is called SVM — Secure Virtual Machine — and while it achieves the same isolation goals as VT-x, it makes different tradeoffs that show up clearly in KVM's two backends.
Detecting and Enabling SVM
SVM availability is signaled by CPUID leaf 0x80000001, ECX bit 2 = 1. The
feature capability leaf 0x8000000A gives further detail: EAX returns the SVM
revision number, EBX returns the NASID (number of available ASIDs), and EDX
carries individual feature bits. The features that matter most in practice are
bit 0 (Nested Page Tables), bit 3 (NRIP Save — next sequential RIP recorded in
the VMCB on exit, sparing the hypervisor from decoding the faulting instruction),
bit 5 (VMCB Clean Bits — allows the CPU to cache VMCB fields across consecutive
VMRUN calls), and bit 13 (AVIC — Advanced Virtual Interrupt Controller,
hardware-accelerated interrupt delivery).
Enabling SVM requires setting EFER.SVME (bit 12 of MSR 0xC0000080). Before
doing so, the VMM reads MSR MSR_VM_CR at 0xC0010114 and checks bit 4
(SVMDIS). If SVMDIS is 1, firmware has locked SVM off and it cannot be
re-enabled without a power cycle — the lock is asymmetric by design, because
some enterprise security policies prohibit guest execution. Finally, the VMM
allocates a 4 KiB-aligned host save area and writes its physical address to
MSR_VM_HSAVE_PA at 0xC0010117. This page is where the CPU will save host
state on every VMRUN.
The VMCB
AMD's counterpart to the VMCS is the VMCB — Virtual Machine Control Block —
a 4 KiB page whose physical address the hypervisor passes in RAX to the
VMRUN instruction. Unlike the VMCS, the VMCB is a plain memory-mapped
structure with documented field offsets. The hypervisor reads and writes it with
ordinary load and store instructions after mapping it into the host virtual
address space. That accessibility makes VMCB manipulation faster than VMCS
manipulation (no serializing VMREAD/VMWRITE pairs) but means KVM must be
careful about cache coherency and VMCB clean bits.
The VMCB splits into two halves. The control area (bytes 0x000–0x3FF,
1024 bytes) holds everything the CPU consults before guest entry: intercept
vectors, TSC offset, IOPM and MSRPM pointers, ASID, TLB controls, interrupt
controls, event injection, the NPT root pointer, and VMCB clean bits. The
state save area (starting at byte 0x400) holds the full architectural
state of the vCPU. The layout, from arch/x86/include/asm/svm.h, places segment
registers and descriptors at 0x400, EFER at 0x4D0, CR4 at 0x548, CR3 at
0x550, CR0 at 0x558, RFLAGS at 0x570, RIP at 0x578, RSP at 0x5D8,
RAX at 0x5F8, the SYSCALL MSRs (STAR=0x600, LSTAR=0x608, CSTAR=0x610,
SFMASK=0x618, KERNEL_GS_BASE=0x620), g_PAT at 0x668, and SPEC_CTRL at
0x6E0. The struct is 744 bytes in total. CET state fields (s_cet, ssp,
isst_addr) sit between RSP and RAX, which is why the latter fields are offset
significantly further than they appear in earlier versions of the header.
Intercept Controls
At VMCB offset 0x000, six consecutive 32-bit words (192 bits total) form the
intercept bitmap. Each bit controls whether a specific guest operation triggers
#VMEXIT. KVM manipulates these through vmcb_set_intercept() and
vmcb_clr_intercept() from arch/x86/include/asm/svm.h.
Selected flat bit indices: 96 (external interrupt), 97 (NMI), 107 (CPUID), 120
(HLT), 123 (IOIO — I/O port access), 124 (MSR_PROT — MSR access controlled by
the MSRPM), 127 (SHUTDOWN — triple fault), 128 (VMRUN — always intercepted in
any nested-SVM setup so a guest hypervisor cannot execute VMRUN directly), 129
(VMMCALL — hypercall), and 141 (XSETBV).
The I/O and MSR permission maps work analogously to the VT-x bitmaps. The
MSRPM is 8 KiB at 4 KiB alignment: four 2 KiB regions covering MSR ranges
0x00000000–0x1FFF, 0xC0000000–0xC0001FFF, and 0xC0010000–0xC0011FFF, with
2 bits per MSR (read intercept and write intercept). The IOPM is 12 KiB with
one bit per I/O port. KVM programs both during vCPU creation and updates them as
the virtual device model grows.
VMCB Clean Bits
On CPUs where CPUID 0x8000000A EDX bit 5 is set, the processor can cache VMCB
field groups across consecutive VMRUN calls. The clean bits field at VMCB
offset 0x0C0 acts as a validity bitmap: when a bit is set, the CPU is permitted
to use its cached copy instead of re-reading from memory. The hypervisor must
clear any bit whose corresponding fields it has modified since the last VMRUN.
Selected bits from arch/x86/kvm/svm/svm.h:
| Bit | Covers |
|---|---|
VMCB_INTERCEPTS (0) |
Intercept vectors, TSC offset, pause filter |
VMCB_PERM_MAP (1) |
IOPM and MSRPM base addresses |
VMCB_ASID (2) |
ASID |
VMCB_INTR (3) |
Interrupt control fields |
VMCB_NPT (4) |
NPT enable, nested_cr3, g_PAT |
VMCB_CR (5) |
CR0, CR3, CR4, EFER |
VMCB_SEG (8) |
CS, DS, SS, ES, CPL |
KVM clears the appropriate bits whenever it modifies a field group, and sets
them all at the end of a successful #VMEXIT handler so the next VMRUN can
take maximum advantage of caching. On a busy system where most exits are IOIO or
MSR faults with no CR changes, bits 0, 1, 3, 4, 5, and 8 can survive across
hundreds of consecutive entries, materially reducing the cost of each VMRUN.
VMRUN and #VMEXIT
VMRUN rAX (opcode 0F 01 D8) is the SVM instruction that corresponds to both
VMLAUNCH and VMRESUME combined. There is no separate "first entry"
instruction: the CPU loads guest state from the VMCB, applies control fields,
and begins executing guest code. Host state is saved to the MSR_VM_HSAVE_PA
page automatically — the fields saved are SS selector, RSP, CR0, CR3, CR4, EFER,
IDTR, and GDTR.
On #VMEXIT, the processor writes guest state back into the VMCB state save
area, records the exit reason at VMCB offset 0x070 (exit_code), stores
additional qualification at 0x078 (exit_info_1) and 0x080 (exit_info_2),
restores host state from MSR_VM_HSAVE_PA, and jumps to the host #VMEXIT
handler.
Not everything goes through VMRUN's automatic save mechanism. VMSAVE rAX
(opcode 0F 01 DB) and VMLOAD rAX (opcode 0F 01 DA) handle extended state:
FS/GS base, LDTR, TR, STAR, LSTAR, CSTAR, SFMASK, KernelGsBase, and the
SYSENTER MSRs. KVM calls VMSAVE before VMRUN to capture any host extended
state, and VMLOAD after to restore it. Forgetting this step would leave the
host's FS base mapped as the guest's FS base after the first VMRUN, a
privilege crossing that no software check would catch.
If the NRIP feature is present (CPUID 0x8000000A EDX bit 3), the processor
records the next sequential instruction address at VMCB offset 0x0C8
(next_rip) on every #VMEXIT. KVM uses this to skip the faulting instruction
when emulating I/O port accesses and similar traps, avoiding a full instruction
decode.
Selected #VMEXIT exit codes from arch/x86/include/uapi/asm/svm.h:
| Constant | Value | Meaning |
|---|---|---|
SVM_EXIT_EXCP_BASE |
0x040 |
Exception base (vectors 0–31 at 0x040–0x05F) |
SVM_EXIT_INTR |
0x060 |
External interrupt |
SVM_EXIT_NMI |
0x061 |
NMI |
SVM_EXIT_VINTR |
0x064 |
Virtual interrupt window open |
SVM_EXIT_CPUID |
0x072 |
CPUID |
SVM_EXIT_HLT |
0x078 |
HLT |
SVM_EXIT_IOIO |
0x07B |
I/O port access |
SVM_EXIT_MSR |
0x07C |
MSR access |
SVM_EXIT_SHUTDOWN |
0x07F |
Triple fault |
SVM_EXIT_VMRUN |
0x080 |
Guest executed VMRUN |
SVM_EXIT_VMMCALL |
0x081 |
Hypercall |
SVM_EXIT_NPF |
0x400 |
Nested page fault |
SVM_EXIT_VMGEXIT |
0x403 |
SEV-ES VMGEXIT |
SVM_EXIT_NPF at 0x400 is AMD's nested-page-fault exit — the equivalent of
Intel's EPT violation (exit reason 48). exit_info_1 carries the fault-error
bits and exit_info_2 carries the guest-physical address that faulted.
ASIDs and TLB Management
SVM uses ASIDs (Address Space Identifiers) to tag TLB entries per guest,
preventing cross-VM TLB pollution and avoiding full flushes on every VMRUN.
ASID 0 is reserved for the host. The maximum ASID is CPUID(0x8000000A).EBX - 1
(NASID minus one). The ASID is set in the VMCB control area at offset 0x058
and the TLB flush mode is set at 0x05C (tlb_ctl):
| Value | Meaning |
|---|---|
| 0 | TLB_CONTROL_DO_NOTHING — reuse existing TLB entries |
| 1 | TLB_CONTROL_FLUSH_ALL_ASID — flush all TLB entries with this ASID |
| 3 | TLB_CONTROL_FLUSH_ASID — flush non-global entries for this ASID |
| 7 | TLB_CONTROL_FLUSH_ASID_LOCAL — flush on this logical CPU only |
When the per-CPU ASID counter exhausts the pool (next_asid > max_asid), KVM
increments its generation counter and writes TLB_CONTROL_FLUSH_ALL_ASID to
force a clean slate on the next VMRUN. Intel's analogue is the VPID (16-bit,
VMCS field 0x0000) and the INVVPID instruction.
Event Injection
Event injection on AMD uses event_inj (VMCB offset 0x0A8, 32-bit). The
layout is structurally identical to VT-x: bits 7:0 are the vector, bits 10:8
are the type (0 = hardware interrupt, 2 = NMI, 3 = exception, 4 = software
interrupt), bit 11 is the error-code-valid flag, bit 31 marks the field valid.
When bit 31 is set on entry into VMRUN, the processor delivers the event to
the guest before executing its first instruction.
The interrupt-window mechanism differs between the two architectures. VT-x uses
a dedicated primary processor-based control (field 0x4002, bit 2,
"interrupt-window exiting") that causes an immediate VM exit when the guest
reaches an interruptible state (RFLAGS.IF = 1, no blocking). AMD uses the
V_IRQ bit in the int_ctl field (VMCB offset 0x060), which signals that a
virtual interrupt is pending; when the guest becomes interruptible, the processor
fires SVM_EXIT_VINTR (0x064). The end result is the same — the VMM gets a
callback at the first moment it is safe to inject — but the mechanism differs.
Nested Page Tables
AMD's second-level address translation is called NPT (Nested Page Tables),
also marketed as RVI (Rapid Virtualization Indexing). It was introduced with the
"Barcelona" Family 0x10 Opteron. Enabling NPT requires setting the SVM_MISC_ENABLE_NP
bit in the VMCB misc_ctl field at offset 0x090, and writing the host-physical
address of the nested page-table root into nested_cr3 at offset 0x0B0. KVM
additionally clears the INTERCEPT_INVLPG bit and removes PF_VECTOR from the
exception intercept bitmap when NPT is active — with NPT, page faults inside the
guest no longer need to exit, because the hardware resolves guest-physical to
host-physical without software involvement.
One documented asymmetry between AMD NPT and Intel EPT: AMD NPT does not support
execute-only mappings. An NPT entry with execute permission set must also have
read permission. Intel EPT permits execute-only pages (XWR bits 0b100). This
asymmetry surfaces in KVM's NPT entry construction and in any hypervisor that
tries to use execute-only guard pages for shadow-stack hardening.
VT-x and SVM Side by Side
The two architectures solve the same problem with the same primitives but make opposite tradeoffs on structure access:
| Aspect | Intel VT-x | AMD SVM |
|---|---|---|
| Control block | VMCS (opaque, VMREAD/VMWRITE) |
VMCB (4 KiB memory-mapped struct) |
| Entry instruction | VMLAUNCH / VMRESUME |
VMRUN rAX |
| Exit reason location | VMCS VM_EXIT_REASON (0x4402) |
VMCB exit_code offset 0x070 |
| Host state save | VMCS host-state area | MSR_VM_HSAVE_PA physical page |
| TLB tagging | VPID (16-bit, VMCS 0x0000) |
ASID (32-bit, VMCB 0x058) |
| SLAT | EPT (execute-only pages supported) | NPT (execute-only not supported) |
| Interrupt window | Primary control bit 2 | V_IRQ in int_ctl → SVM_EXIT_VINTR |
| Hypercall | VMCALL |
VMMCALL |
| Next-RIP on exit | VM-exit instruction-length VMCS field | next_rip at VMCB 0x0C8 (NRIP feature) |
The structural difference matters to KVM's two backends (arch/x86/kvm/vmx/vmx.c
and arch/x86/kvm/svm/svm.c), which share a common struct kvm_vcpu core but
diverge entirely on how they program and read control state. From the guest's
perspective — the one running at CPL 0 in non-root operation — the difference is
invisible.
ARM Virtualization Extensions
Exception Levels
AArch64 organizes privilege into four exception levels:
EL0 and EL1 are mandatory on every AArch64 implementation. EL2 and EL3 are
optional; hardware that omits EL2 provides no virtualization extensions and
cannot run KVM. Code at EL0 cannot access system registers at all. Code at EL1
can access EL1 system registers but not EL2 or EL3 registers. EL2 can access
EL1 registers (to save and restore guest context) and has its own set of
hypervisor registers — HCR_EL2, VTCR_EL2, VTTBR_EL2, and others —
that EL1 cannot read or write.
This is not the x86 root/non-root distinction. There is no separate mode bit layered below the ring hierarchy. The EL hierarchy is the isolation boundary. A guest OS at EL1 simply does not have the instruction encodings to write EL2 registers. Any attempt raises an exception that routes to EL2 rather than executing.
Exceptions in AArch64 can only move to the same level or a higher level on
entry. ERET can only return to the same or a lower level. A guest kernel at
EL1 issuing an ERET cannot route execution to EL2 — it would return to EL0.
The hardware exception model makes EL2 a strict parent of EL1, not a peer
reachable from below.
HCR_EL2
HCR_EL2, the Hypervisor Configuration Register, is a 64-bit system
register at EL2 that governs what EL1 and EL0 can do. It is the primary control
surface for a KVM vCPU, roughly analogous to the VMCS execution-control fields
for VT-x.
The most important bit for basic virtualization is bit 0, VM: when set, the CPU enables stage-2 address translation for the EL1&0 regime, translating Intermediate Physical Addresses (what the guest calls physical memory) to Host Physical Addresses (the real DRAM locations). Clearing bit 0 makes the guest's physical addresses resolve directly to host-physical — appropriate only during early bring-up when the hypervisor is not yet protecting guest memory.
Bits 3 (FMO), 4 (IMO), and 5 (AMO) route physical FIQ, IRQ, and SError exceptions to EL2, preventing the guest from seeing raw hardware interrupts and allowing the hypervisor to inject virtual IRQs through the GIC instead.
Bit 13 (TWI) traps WFI (Wait For Interrupt) from EL0 and EL1 to EL2.
This is the ARM equivalent of VT-x's HLT exiting (exit reason 12). When a guest
vCPU executes WFI to idle, KVM gets control and can schedule another vCPU or
yield the physical core.
Bit 18 (TID3) traps EL1 reads of the group-3 ID registers — the registers
advertising CPU features, implementation options, and ISA revisions — to EL2.
KVM intercepts these reads and returns synthesized values, which is the
mechanism behind Firecracker's V1N1 static CPU template on ARM: a host
running on an AWS Graviton (Neoverse V1 microarchitecture) presents itself to
the guest as Neoverse N1, improving migration portability across instance types.
Firecracker's V1N1 template requires host KVM capabilities
KVM_CAP_ARM_PTRAUTH_ADDRESS (171) and KVM_CAP_ARM_PTRAUTH_GENERIC (172) so
it can safely expose or suppress pointer-authentication features.
Bit 31 (RW) sets the execution state for EL1: 1 means AArch64, 0 means AArch32. Every 64-bit Linux guest needs this set to 1.
Bit 34 (E2H) enables Virtualization Host Extensions. Bit 46 (FWB, added in ARMv8.4) allows stage-2 attributes to directly override stage-1 cacheability, giving the hypervisor direct control over guest memory type without the guest being able to influence it.
Stage-2 Address Translation
When HCR_EL2.VM = 1, every memory access from EL1 or EL0 goes through two
independent MMU walks. The guest OS programs its own page tables as always,
translating guest-virtual addresses to what it believes are physical addresses —
ARM calls these Intermediate Physical Addresses (IPA). The hardware then
performs a second walk, controlled by the hypervisor, translating each IPA to a
Host Physical Address (HPA).
VTCR_EL2 (Virtualization Translation Control Register) configures the
stage-2 walk: T0SZ (bits 5:0) defines the IPA input range as 2^(64-T0SZ)
bytes; SL0 (bits 7:6) sets the starting lookup level; TG0 (bits 15:14)
selects the translation granule (4 KB = 0b00, 64 KB = 0b01, 16 KB =
0b10); PS (bits 18:16) sets the output address size.
VTTBR_EL2 carries two things: the VMID in bits [63:48] (for 16-bit VMIDs,
ARMv8.1+) or [55:48] (for 8-bit VMIDs, ARMv8.0), and the stage-2 page-table
base address in bits [47:1]. The VMID tags TLB entries per VM exactly as ASID
does for AMD or VPID does for Intel: world-switching between two VMs does not
require flushing the TLB so long as each VM has a distinct VMID.
The stage-2 walk is independent of stage-1. The hypervisor can disable or re-enable stage-2 independently, which is useful during early boot when the hypervisor is initializing a guest's memory map before enabling the full translation regime.
VHE: Running the Host Kernel at EL2
Without Virtualization Host Extensions, the standard AArch64 arrangement places
the KVM hypervisor stub at EL2 and the host Linux kernel at EL1. Every guest
entry and exit requires a full world switch: save all EL1 host registers, load
all EL1 guest registers, ERET to guest EL1 on entry; reverse the process on
exit, returning to EL2, then dropping back to EL1 host context. The host and
guest EL1 contexts are completely symmetric — both are "just an EL1" — but they
cannot coexist on the CPU simultaneously.
VHE (Virtualization Host Extensions), introduced in ARMv8.1-A, collapses
this asymmetry. When HCR_EL2.E2H = 1 (bit 34), the CPU enters a mode where
EL2 becomes a superset of EL1: the host kernel can run directly at EL2 with full
OS semantics. Most EL1 system registers accessed from EL2 redirect to their EL2
equivalents — SCTLR_EL1 at EL2 accesses SCTLR_EL2, and so on. New _EL12
aliases (SCTLR_EL12, TCR_EL12, TTBR0_EL12, VBAR_EL12) give the
hypervisor access to the actual EL1 register contents for guest context
save/restore, without confusion from the E2H redirect.
A companion bit, HCR_EL2.TGE (bit 27), switches user-space semantics. When
TGE = 1 alongside E2H = 1, all physical exceptions from EL0 route to EL2 —
the EL0 threads are treated as host-OS user processes. When TGE = 0, EL0
threads are guest user space, with exceptions routing to EL1 as normal.
Linux KVM on ARMv8.1+ runs the host kernel at EL2 using VHE. The same kernel
binary supports both VHE and non-VHE (nVHE) via runtime alternative instruction
patching decided at boot based on CPU feature detection. Starting with ARMv9.5,
implementations may make HCR_EL2.E2H a RES1 field — permanently 1, making VHE
the only implemented behavior and removing the non-VHE code path from relevance
on new silicon.
flowchart LR
subgraph nvhe["nVHE (ARMv8.0)"]
el2s["KVM stub at EL2"]
el1h2["Host kernel at EL1"]
el1g2["Guest at EL1"]
el2s -->|"world switch"| el1g2
el2s --> el1h2
end
subgraph vhe["VHE (ARMv8.1+, E2H=1)"]
el2v["Host kernel + KVM at EL2"]
el1gv["Guest at EL1"]
el2v -->|"ERET / exception"| el1gv
end
Exit Handling on ARM
When the guest triggers an exception that routes to EL2, the syndrome register
ESR_EL2 records what happened. Bits 31:26 (EC field) carry the exception
class. KVM's arm_exit_handlers[] array in arch/arm64/kvm/handle_exit.c maps
EC values to handler functions via kvm_get_exit_handler().
Selected EC codes:
| Symbol | Cause |
|---|---|
ESR_ELx_EC_WFx |
WFI/WFE trap — guest idle |
ESR_ELx_EC_HVC64 |
HVC — hypervisor call from guest |
ESR_ELx_EC_SMC64 |
SMC — secure monitor call |
ESR_ELx_EC_SYS64 |
System register access trap |
ESR_ELx_EC_DABT_LOW |
Data abort / MMIO / stage-2 fault |
ESR_ELx_EC_IABT_LOW |
Instruction abort |
ESR_ELx_EC_DABT_LOW is the ARM equivalent of an EPT violation: a stage-2
fault on a data access. The handler walks the ESR fields to determine whether the
fault is MMIO (no mapping exists and the address is in a device region) or a
genuine page fault (a mapping needs to be installed), then either emulates the
device access or populates the stage-2 table.
GIC Virtualization
ARM guests need to receive interrupts. The Generic Interrupt Controller (GIC)
has had virtualization support since GICv2, adding a two-register-bank split:
the virtual interface control block (GICH_* registers) that the hypervisor
programs, and the virtual CPU interface (GICV_* registers) that the guest
reads as if they were the physical GIC CPU interface registers.
Virtual interrupt delivery uses GICH_LRn — the List Registers. Each LR entry
specifies a virtual interrupt ID, an optional physical interrupt ID (when HW = 1
for hardware-backed IRQs), a priority, a state (pending, active, or
pending+active), and EOI behavior. The number of list registers is reported by
GICH_VTR, typically between 4 and 16 on real hardware. GICH_HCR bit 0 (En)
must be set before guest entry or no virtual interrupts will be delivered.
When HW = 1 in a list register, hardware deactivates the physical interrupt
automatically when the guest writes EOI to GICV_EOIR. When HW = 0, the
hypervisor must manually deactivate the physical IRQ — usually by writing the
physical INTID to GICD_DIR — after the guest's interrupt handler runs.
GICv3 and GICv4 move the interface to system registers: ICH_HCR_EL2 replaces
GICH_HCR, ICH_LR<n>_EL2 (up to 16) replace GICH_LRn, and the virtual CPU
interface becomes ICV_* registers at EL1. The architecture is part of the
AArch64 architectural state from ARMv8 onward; GICv3 support is mandatory for
any arm64 server platform.
Firecracker on aarch64 requires KVM_CAP_IRQCHIP — in-kernel GICv2 or GICv3
emulation — in addition to /dev/kvm. Hardware with CONFIG_KVM=y but no IRQ
chip capability at runtime cannot host Firecracker guests; this was confirmed by
Firecracker issue #1186, where a Raspberry Pi 3 built with KVM enabled failed
the capability check at startup.
PSCI
Guest power management on ARM does not use ACPI PM registers. Instead, the guest issues PSCI (Power State Coordination Interface, ARM DEN0022) calls using the SMCCC (DEN0028) calling convention. The conduit is SMC when EL3 is present, or HVC when only EL2 is available.
KVM emulates PSCI via kvm_smccc_call_handler(). Selected 64-bit function IDs:
CPU_ON is 0xC4000003, CPU_SUSPEND is 0xC4000001, SYSTEM_OFF is
0x84000008, and SYSTEM_RESET is 0x84000009. KVM exposes PSCI to guests via
the KVM_ARM_VCPU_PSCI_0_2 feature flag set on the vCPU. For events it cannot
handle internally — SYSTEM_OFF and SYSTEM_RESET — KVM returns
KVM_EXIT_SYSTEM_EVENT to user space. In Firecracker, a poweroff command
inside the guest triggers a PSCI SYSTEM_OFF call, KVM emits
KVM_EXIT_SYSTEM_EVENT, and Firecracker performs a clean shutdown, unmounting
storage and releasing resources before the VMM process exits.
How Silicon Enforces the Boundary
The same question runs through all three architectures: what actually prevents guest ring 0 from becoming host ring 0? The answer is not a software invariant or an operating system convention — it is hardware decode logic that the guest cannot observe or bypass.
On x86 with VT-x, the CPU maintains an internal mode bit distinguishing VMX root
from VMX non-root operation. Guest code executing at CPL 0 in VMX non-root
cannot use VMX instructions — they cause VM exits, not execution. It cannot read
or write the VMCS — VMREAD and VMWRITE are unconditionally intercepted. It
cannot execute VMXOFF or VMXON — also unconditionally intercepted. The
guest's view of physical memory is confined to what EPT maps, and EPT is
controlled entirely by the VMM running in root mode. The guest kernel therefore
operates at full CPL 0 privilege within its architectural world, but that world
is bounded by hardware: nothing the guest can do in non-root mode reaches host
state or host memory.
On AMD-V, the VMRUN intercept bit (index 128 in the VMCB intercept bitmap) is
required in any system where a guest hypervisor might try to execute VMRUN
itself. Every VMRUN the guest attempts fires a #VMEXIT back to the real
hypervisor. The VMCB resides in host-physical memory that NPT does not expose to
the guest once the nested page tables are active. Host state is saved to and
restored from the MSR_VM_HSAVE_PA page, which is likewise outside the guest's
NPT-visible address space.
On ARM, the mechanism is the exception level hierarchy itself. A guest OS at EL1
does not have instruction encodings that write EL2 registers. Writing HCR_EL2
requires EL2; the guest cannot disable stage-2 translation by clearing
HCR_EL2.VM. Writing VTTBR_EL2 to install new stage-2 page tables requires
EL2; the guest cannot remap the hypervisor's memory into its own address space.
Any EL1 instruction that attempts to access an EL2-only register raises an
exception that routes to EL2 — to the hypervisor — not back to EL1. The hardware
exception model guarantees that exceptions from one VM's EL1 cannot route to
another VM's EL1; they always traverse EL2 first.
Chapter 5 builds on this foundation — EPT and NPT give the hypervisor the same kind of structural control over the guest's view of physical memory that the mode bit and exception-level hierarchy provide over instruction execution.
Sources And Further Reading
- Intel SDM Vol 3C (virtualization chapter, primary reference for VT-x, VMCS structure, VMLAUNCH/VMRESUME, VM-exit reason table): https://cdrdv2-public.intel.com/789585/326019-sdm-vol-3c.pdf
- Intel SDM Vol 3 combined: https://cdrdv2-public.intel.com/671447/325384-sdm-vol-3abcd.pdf
- felixcloutier x86 instruction reference — VMLAUNCH/VMRESUME: https://www.felixcloutier.com/x86/vmlaunch:vmresume
- felixcloutier — VMXON: https://www.felixcloutier.com/x86/vmxon
- felixcloutier — VMREAD: https://www.felixcloutier.com/x86/vmread
- felixcloutier — CPUID: https://www.felixcloutier.com/x86/cpuid
- Intel SDM rendered — VMCS region format: https://xem.github.io/minix86/manual/intel-x86-and-64-manual-vol3/o_fe12b1e2a880e0ce-1049.html
- Intel SDM rendered — VMCS field encoding table: https://xem.github.io/minix86/manual/intel-x86-and-64-manual-vol3/o_fe12b1e2a880e0ce-1071.html
- Intel SDM rendered — VM-exit reason field: https://xem.github.io/minix86/manual/intel-x86-and-64-manual-vol3/o_fe12b1e2a880e0ce-1067.html
- Intel SDM rendered — exit reason table (Appendix C): https://xem.github.io/minix86/manual/intel-x86-and-64-manual-vol3/o_fe12b1e2a880e0ce-1961.html
- Intel SDM rendered — IA32_VMX_BASIC: https://xem.github.io/minix86/manual/intel-x86-and-64-manual-vol3/o_fe12b1e2a880e0ce-1943.html
- Linux kernel
arch/x86/include/asm/vmx.h(VMCS field encodings): https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/vmx.h - Linux kernel
arch/x86/kvm/vmx/vmcs.h(struct loaded_vmcs,vmcs01/12/02): https://github.com/torvalds/linux/blob/master/arch/x86/kvm/vmx/vmcs.h - Linux kernel
arch/x86/kvm/vmx/vmx.c(KVM VT-x backend): https://github.com/torvalds/linux/blob/master/arch/x86/kvm/vmx/vmx.c - KVM nested VMX documentation: https://docs.kernel.org/virt/kvm/x86/nested-vmx.html
- axvisor VMX definitions and exit-reason types: https://arceos-hypervisor.github.io/axvisor-crates-book/x86_vcpu/VMX_Definitions_and_Types.html
- Linux kernel
arch/x86/include/asm/svm.h(VMCB struct, intercept enums, clean bits, EVENTINJ): https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/svm.h - Linux kernel
arch/x86/include/uapi/asm/svm.h(SVM_EXIT_*codes): https://github.com/torvalds/linux/blob/master/arch/x86/include/uapi/asm/svm.h - Linux kernel
arch/x86/include/asm/msr-index.h(MSR addresses includingMSR_VM_CR,MSR_VM_HSAVE_PA,EFER): https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/msr-index.h - Linux kernel
arch/x86/kvm/svm/svm.c(ASID allocation, NPT init,VMRUN/VMSAVE/VMLOADpath): https://github.com/torvalds/linux/blob/master/arch/x86/kvm/svm/svm.c - Linux kernel
arch/x86/kvm/svm/svm.h(VMCB clean bits enum): https://github.com/torvalds/linux/blob/master/arch/x86/kvm/svm/svm.h - raw-cpuid crate
SvmFeatures(CPUID0x8000000AEDX bits): https://docs.rs/raw-cpuid/latest/raw_cpuid/struct.SvmFeatures.html - AMD APM Vol 2 (doc 24593, authoritative VMCB reference): https://www.amd.com/system/files/TechDocs/24593.pdf
- NoirVisor SVM core — NPT execute-only limitation: https://github.com/Zero-Tang/NoirVisor/blob/master/src/svm_core/readme.md
- ARM AArch64 Exception Model v1.3: https://documentation-service.arm.com/static/63a065c41d698c4dc521cb1c
- ARM Developer — Exception levels: https://developer.arm.com/documentation/dui0801/l/Overview-of-AArch64-state/Exception-levels
- ARM HCR_EL2 register reference (DDI0601): https://developer.arm.com/documentation/ddi0601/latest/AArch64-Registers/HCR-EL2--Hypervisor-Configuration-Register
- Jon Palmisc ARM register reference — HCR_EL2: https://arm.jonpalmisc.com/latest_sysreg/AArch64-hcr_el2
- ARM Learn the Architecture — Stage-2 translation: https://developer.arm.com/documentation/102142/0100/Stage-2-translation
- ARM Learn the Architecture — VHE: https://developer.arm.com/documentation/102142/0100/Virtualization-host-extensions
- LWN — arm64 VHE patch series: https://lwn.net/Articles/650524/
- LWN — FEAT_E2H0, ARMv9.5 E2H=RES1: https://lwn.net/Articles/959153/
- Linux kernel
arch/arm64/kvm/handle_exit.c(EC code handlers,arm_exit_handlers[]): https://github.com/torvalds/linux/blob/master/arch/arm64/kvm/handle_exit.c - KVM ARM PSCI documentation: https://www.kernel.org/doc/html/v5.6/virt/kvm/arm/psci.html
- arm-trusted-firmware
psci.h(PSCI function IDs): https://github.com/96boards-poplar/arm-trusted-firmware/blob/master/include/lib/psci/psci.h - ARM GICv2 List Registers IHI0048B: https://developer.arm.com/documentation/ihi0048/b/GIC-Support-for-Virtualization/GIC-virtual-interface-control-registers/List-Registers--GICH-LRn
- OSDev — GICv3/v4: https://wiki.osdev.org/Generic_Interrupt_Controller_versions_3_and_4
- Firecracker FAQ: https://github.com/firecracker-microvm/firecracker/blob/main/FAQ.md
- Firecracker CPU templates documentation (V1N1 ARM template, TID3 mechanism): https://github.com/firecracker-microvm/firecracker/blob/main/docs/cpu_templates/cpu-templates.md
- Firecracker issue #1186 (aarch64 KVM IRQ chip capability requirement): https://github.com/firecracker-microvm/firecracker/issues/1186