In 1974, Gerald Popek and Robert Goldberg published a paper, “Formal Requirements for Virtualizable Third Generation Architectures”, giving a set of characteristics for correct full-machine virtualisation.
Today, these characteristics remain very useful. Computer architects will informally cite this paper when debating Instruction Set Architecture (ISA) developments, with arguments like “but that’s not Popek & Goldberg-compliant!”
In this post I’m looking at one aspect of computer architecture evolution since 1974, and observing how RISC-style atomic operations provide some potential virtualisation gotchas for both programmers and architects.
Principles of virtualisation
First, some virtualisation context, because it’s fun!
A key P&G requirement is that of equivalence: it’s reasonable to expect software running under virtualisation to have the same behaviour as running it bare-metal! This property is otherwise known as correctness. :-)
P&G classify instructions as being sensitive if they behave differently when running at a lower privilege level (i.e. the program can detect that it is being run in a different manner). An ISA is said to be classically virtualisable if:
- Sensitive instructions are privileged, and
- Privileged instructions executed at a lower privilege level can be trapped to a higher level of privilege.
For a classically-virtualisable system, perfect equivalence can then be achieved by running software at a lower than usual level of privilege, trapping all privileged/sensitive instructions, and emulating their behaviour in a VMM. That is, if the design of the ISA ensures that all “sensitive” instructions can be trapped, it’s possible to ensure the logical execution of the software cannot be different to running bare-metal.
This virtualisation technique is called “privilege compression”.
Note: This applies recursively, running OS-level software with user privilege, or hypervisor-level software at OS/user privilege. Popek & Goldberg formalise this too, giving properties required for correct nested virtualisation.
System/360 and PowerPC are both classically virtualisable, almost as though IBM thought about this. ;-) Equivalent virtualisation can be achieved by:
- Running an OS in user mode (privilege compression, for CPU virtualisation),
- Catching traps (to supervisor mode/HV) when the guest OS performs a privileged operation,
- In the hypervisor, operating on a software-maintained “shadow” of what would have been the guest OS’s privileged CPU state were it running bare-metal.
- Constructing shadow address translations (for memory virtualisation).
Linux’s KVM support on PowerPC includes a “PR” feature, which does just this: for CPUs without hardware virtualisation, guests are run in user mode (or “PRoblem state” in IBM lingo).
Note: It is key that the hypervisor can observe and control all of the guest’s state.
Today, most systems address the performance impact of all of this trap-and-emulate by providing hardware CPU and memory virtualisation (e.g. user, OS and hypervisor execution privilege levels, with nested page tables). But, classically virtualisable ISA design remains important for clear reasoning about isolation between privilege levels and composability of behaviours.
Computers in 1974 were ~all CISC
All computers in 1974 were available in corduroy with a selection of Liberty-print input devices. All consoles had ashtrays (not even joking tbh).
Architecture-wise, IBM was working on early RISC concepts leading to the 701, but most of the industry was on a full-steam trajectory to peak CISC (VAX) in the late 1970s. It’s fair to say that “CISC” wasn’t even a thing yet; instruction sets were just complex. P&G’s paper considered three contemporary computers:
- IBM System/360
- Honeywell 6000
- DEC PDP-10
CISC atomic operations and synchronisation primitives
These machines had composite/”read-modify-write” atomic operations, similar to those in today’s x86 architectures. System/360 had compare-and-swap, locked operations (read-operate-write), test-and-set, and PDP-10 had EXCHange/swap.
These kinds of instructions are not sensitive so, unless the addressed memory is privileged, atomic operations can be performed inside virtual machines without the hypervisor needing to know.
Atomic operations in RISC machines
Many RISC machines support multi-instruction synchronisation sequences built up around two instruction primitives:
MIPS called these load-linked (LL) and store-conditional (SC), and I’ll use these terms. ARMv8 has LDXR/STXR. PowerPC has LWARX/STWCX. RISC-V has LR/SC. Many machines (such as ARMv8-LSE) also add composite operations such as CAS or atomic addition but still retain the base LL/SC mechanism, and sizes/acquire/release variants are often provided.
The concept is that the LL simultaneously loads a value and sets a “reservation” covering the address in question, and a subsequent SC succeeds only if the reservation is still present. A conflicting write to the location (e.g. a store on another CPU) clears the reservation and the SC returns a failure value without modifying memory; LL/SC are performed in a loop to retry until the update succeeds.
An LL/SC sequence can typically be arbitrarily complex – a lock routine might test a location is cleared and store a non-zero value if so, whereas an update might increment a counter or calculate a “next” value, and so on. Typically an ISA does not restrict what lies between LL and SC.
Coming back to virtualisation requirements, the definition of a reservation is interesting because it’s effectively “hidden state” that the hypervisor cannot manage. Typically, a hypervisor cannot easily read whether a reservation exists, and it can’t be saved/restored1. CISC-like RmW atomic operations do not exhibit this property.
Problem seen, problem felt
Shall I get to the point? I saw an odd but legal guest code sequence that can be difficult to virtualise.
I’ve been trying to run MacOS 9.2 in KVM-PR on a PowerPC G4, and observed the NanoKernel acquire-lock routine happens to use a sensitive instruction (mfsprg) between a lwarx and stwcx. This is strange, and guarantees a trap to the host between the LL and SC operations. Though the guest should not be doing weird stuff when acquiring a lock, it’s still an architecturally-correct program.
This means that if the reservation isn’t preserved across the trap, the lock is never taken. Forward progress is never achieved and virtualisation equivalence is not maintained (because the guest livelocks).
Specifically, if the reservation is always cleared on the trap, we have a problem. If it is sometimes kept, the guest program can progress.
Since the state is hidden (the hypervisor can’t save/restore/re-create), correctness depends on two things:
- The hypervisor’s exception-emulation-return path not itself clearing the reservation every time for any possible trap
- The ISA and hardware implementation guaranteeing the reservation is not always cleared by hardware
This potential issue isn’t limited to PPC or the MacOS guest.
The hypervisor must guarantee two things:
- It must not intentionally clear reservations on all traps.
- It must not accidentally do so as a side-effect of a chosen activity:
- For example, using its own synchronisation primitives elsewhere, or by writing memory that would conflict with the guest’s reservation.
This can be challenging: context switching must be avoided in the T&E handler (no sleep or pre-emption), and it can’t take locks.
In my MacOS guest experiment, KVM-PR does not happen to currently use any synchronisation primitives on its emulation path, ew delicate – but I had tracing on, which does. The guest locked up.
But does your CPU guarantee that reservations aren’t always cleared?2
That seems to depend. This morning’s light reading gives:
PowerISA is comparatively clear on the behaviour (which isn’t surprising, as PowerISA is generally very clearly-specified). PowerISA v3.1 section 220.127.116.11 describes reservations, listing specific reasons for reservation loss. Some are the expected “lose the reservation if someone else hits the memory” reasons, but previous PowerISAs (e.g. 2.06) permitted embedded implementations to clear the reservation on all exceptions. This permission was removed by 3.1; in my opinion a good move. (I did just this, for reasons, in my homebrew PowerPC CPU, oops!)
PowerISA does permit spontaneous reservation loss due to speculative behaviour, but is careful to require that forward progress is guaranteed (i.e. that an implementation doesn’t happen to clear the reservation every time for a given piece of code).
Finally, it includes a virtualisation-related programming note stating a reservation may be lost if software executes a privileged instruction or utilizes a privileged facility (i.e. sensitive instructions). This expresses intent, but isn’t specification: it doesn’t criminalise a guest doing wrong things unless it’s a rule that was there from the dawn of time.
At any rate, this post is going to be old news to the PowerISA authors. Nice doc, 8/10, good jokes, would read again.
The lack of any guest legacy permits the problem to be solved from the other direction. Interestingly, the RISC-V ISA explicitly constrains the instruction sequences between LR/SC:
"The dynamic code executed between the LR and SC instructions can only contain instructions from the base “I” instruction set, excluding loads, stores, backward jumps, taken backward branches, JALR, FENCE, FENCE.I, and SYSTEM instructions.“
This is a good move. Tacitly, this bans sensitive instructions in the critical region, and permits an absence of progress if the guest breaks the rules. Ruling out memory accesses is interesting too, because it can be useful for a hypervisor to be able to T&E any given page in the guest address space without repercussions.
Reservation granule size
An LL operation is usually architecturally permitted to set an address-based reservation with a size larger than the original access, called the “reservation granule”. A larger granule reduces tracking requirements but increases the risk of a kind of false sharing between locks where an unrelated CPU taking an unrelated lock could clear your CPU’s reservation.
This is important to our hypervisor, because of guarantee #2 above: when emulating a sensitive instruction it must not access anything that always causes the reservation to clear. You would hope the guest doesn’t soil itself by executing an instruction against its interests, so we can assume the guest won’t intentionally direct the hypervisor to hit on shared addresses, but if hypervisor and guest memory could ever coexist within a reservation granule there is scope for conflict.
PowerPC defines the largest granule as, effectively, the (small) page size. ARM defines it as 4KB (effectively, the same). It’s a reasonable architectural assumption that guest and host memory is disjoint at page size granularity.
RISC-V permits the reservation granule to be unlimited, which isn’t great3 – but later notes that “a platform specification may constrain the size and shape of the reservation set. For example, the Unix platform is expected to require of main memory that the reservation set be of fixed size, contiguous, naturally aligned, and no greater than the virtual memory page size.”
An ISA cannot be classically virtualised if it permits some aspect of trapping or emulation (such as the exception itself) to always cause a reservation to be cleared, unless sensitive instructions are prohibited from any region dependent on a reservation.
In terms of computer science, it’s quite unsatisfying if it were possible to have a sequence of RISC instructions that cannot be classically virtualised due to hidden state.
In practical terms, trap-and-emulate is alive and well in systems supporting nested virtualisation. Although some ISAs provide a level of hardware support for NV, it tends to be assists to speed up use of privilege compression rather than more exception levels and more translation stages (which, to be fair, would be awful). Consequently there is always something hypervisor-privileged being trapped to the real hypervisor, i.e. T&E is used in anger.
So, there are some hardware behaviours which must (continue to be) guaranteed and, unfortunately, some constraints on already-complex software which must be observed.
I thought this small computer architecture safari might be interesting to others, and hope you enjoyed the read!
In theory an ISA could provide the hypervisor with a previous reservation’s address, but re-creating it with a later LL raises ordering model questions! ↩
Sorry for the double-negative, but this alludes to the possibility of architecture permissions (for example, statements like “X is permitted to spontaneously happen at any time”) leading to implementations taking convenient liberties such as “always do X when any cache line is fetched”. If these decisions were to exist, they would be impossible to avoid stepping on, even with a carefully-written hypervisor. ↩
It would be terrible to permit an implementation to allow all hypervisor memory accesses to clear the reservation! ↩