You probably know intuitively that applications have limited powers in Intel x86 computers and that only operating system code can perform certain tasks, but do you know how this really works? This post takes a look at x86 privilege levels, the mechanism whereby the OS and CPU conspire to restrict what user-mode programs can do.
Gustavo Duarte has written a well researched article explaining the User mode and the Kernel mode operation. In order to fully understand this article it is imperative to read his pervious post about the Memory Translation and Segmentation.
I posed a couple of questions in his blog comments and he clarified them with great detail. Here are the questions and the answers:
Q: You had mentioned that modern x86 kernels use the “flat model†without any segmentation. But won’t that restrict the size of the addressable memory to ~4GB?
A: The segments only affect the translation of “logical†addresses into “linear†addresses. Flat model means that these addresses coincide, so we can basically ignore segmentation. But all of the linear addresses are still fed to the paging unit (assuming paging is turned on), so there’s more magic that happens there to transform a linear address into a physical address.
Check the first diagram of the previous post, it should make it more clear. Flat model eliminates the first step (logical to linear), but the last step remains and enables addressing of more than 4GB.
Now, regarding maximum memory, there are three issues: the size of the linear address space, the conversion of a linear address into a physical address, and the physical address pins in the processor so it can talk to the bus.
When the CPU is in 32-bit mode, the linear address space is _always_ 32-bits and is therefore limited to 4 GB. However, the physical pins coming out of CPU can address up to 64GB of RAM on the bus since the Pentium Pro.
So the trouble is all in the translation of linear addresses into physical addresses. When the CPU is running in “traditional†mode, the page tables that transform a linear address into a physical one only work with 32-bit physical addresses, so the computer is confined to 4 GB total RAM.
But then Intel introduced the Physical Address Extension (PAE) to enable servers to use more physical memory. This changes the MECHANISM of translation of a linear address into a physical address. It works by changing the format of the page tables, allowing more bits for the physical address. So at that point, more than 4 GB of physical memory CAN be used.
The problem is that processes are still confined to a 32-bit linear address space. So if you have a database server that wants to address 12 gigs, say, it will have to map different regions of physical memory at a time. It only has a 4 gig linear window into the 12 gigs of physical ram.
Q: If we want to use the PAE, we need changes in the kernel code, right? Is that why we have server versions of Windows?
A: That’s right, the kernel needs to do most of the work, since it’s the one responsible for building the page tables for the processes.
Also, if a single process wants to use more than 4GB, then the process _also_ must be aware of this stuff, because it needs to make system calls into the kernel saying “hey, I want to map physical range 5GB to 6GB in my 2GB-3GB linear rangeâ€, or “map 10GB-11GB nowâ€, and so on. (Of course, there would be security checks. Also, these are some nice round numbers, usually it’d probably be done in chunks of X KB of memory, depends on the application).
Regarding Windows, that’s an interesting point. It’s too strong to say that PAE _is_ why we have server versions of Windows, but it’s definitely something Microsoft has used extensively for price discrimination. Not only on the Windows kernel, but also apps like SQL Server have pricier editions that support PAE. The kernel for the server editions of Windows has other tweaks as well though, in the algorithms for process scheduling and also memory allocation. But PAE has definitely been one carrot (or stick?) to get some more money.
Linux has had PAE support since the start of 2.6. To use it one must enable it at kernel compile time. I’m not sure if it’s enabled in the kernels that ship with the various distros. I’ve never looked much into the kernel PAE code to be honest, so I’m ignorant here. My understanding though is that if it’s enabled, it’s once and for all in the machine, for all CPUs and processes.
What I’ve always wondered is why are rings 1 and 2 never used?
Because other platforms only had two modes, so OSes intended to be used cross-platform preferred to use only two of the four rings. Also, some parts of the IA-32 architecture don’t distinguish 4 rings, but only two modes: system and user, where system means rings 0, 1 and 2, and user means ring 3. Page protection is like this, for example. And finally, adding extra rings just adds extra complexity that could probably be dealt with by using more comprehensive security methodologies, which is currently the case. In much the same way that it’s better to avoid using x86 segmentation or TSS’s for task switching in favor of a software solution that is portable and can be fine-tuned to the needs of the OS in general, or even at a particular point in time (such as under heavy load, etc.).
Rings 1 and 2 were invented to run code that is “less trusted than the kernel, but more trusted than user processes”, such as system software and device drivers. Consequently, some of the special operations from ring 0 are allowed in ring 2, and even some more in 1, but still not all.
It was an early attempt to make the OS kernel resistant against faulty drivers. However, in the end it was too inflexible to support all the different security schemes from various OSes, and so developers started to ignore them and implement everything in ring 0.
Another reason is of course that Unix was designed with only two privilege levels in mind and was not able to make use of four without significant re-design during porting.
There are more parts of x86 design like this, parts that were designed to support OS implementation but in the end turned out to bee too inflexible. The article already mentioned call gates (for hardware-supported system call entry, and possibly inter-process communication) and task gates (hardware-supported context switching and also IPC). It did not mention that the x86 actually has a hardware-implemented process table. It was an attempt to implement half the OS in hardware, and was quickly forgotten when OSes were designed that behaved differently than the hardware implementation. Nowadays it’s only there because of backwards compatibility.
I believe they are not used often / ever merely because of the overhead involved in mode switching.
One dread of programmers ( well, certain ones ) is dropping into kernel mode due to some incurred overhead.
IIRC, ring 1/2 are programmable with varying degrees of privelege. I know that Intel & AMD’s virtualization technologies utilize specially configured Ring 1 so that a guest OS can load normally without interfering with the host OS’s kernel.
I’d imagine ring 1 is programmed to match ring 0, but is merely segmented within the processor. I’d almost beat a nickel that ring 2 c/would, thusly, be programmed to match ring 3, and that ring 1/2 would be used for the guest OS, and 0/3 for the host. Either that, or some manner of tracking is used to segment the data and instructions of the two concurrent systems.
I don’t know much ( yet ) on what accesses can be controlled with varying rings, though I have high hopes of using a ring to create a ‘service-mode/space’ which would act as an intermediary between kernel and user spaces/modes.
In my theoretical model, this service-mode would be able to READ data directly from certain shared areas of the kernel and devices, and READ/WRITE certain user-mode memory within applications which have created a shared memory area for this usage.
My idea is based on improving BeOS’s model with its known issues due to lack of certain accesses and kernel-mode fall-backs ( certainly security flaws ).
Not to mention memory duplication in BeOS on the highest order: BBitmap, which has a copy in the application’s memory space AND a copy in the app_server’s memory space. The app_server cannot read or write into the app’s memory directly ( it can fake it though, using ports-which is slower ), so it keeps a copy of the bitmap locally, while the app keeps a copy on which it can work.
Granted, there are some major locking concerns when it comes to using shared memory ( I’ll keep using bitmaps as an example ). Currently, the app_server doesn’t give a rats arse if the app has made changes to the bitmap, because it has its own local copy which is synced when the app-side bitmap is unlocked and dirty.
If you were to show a bitmap live, during its changes, you would likely see a great deal of artifacts while image processing is being performed. And, if you lock the bitmap from read access, the drawing will not be able to refresh that bitmap to the screen, causing flicker / delay.
Oh well.. before I run too much off topic…
Hope the ring info answers your questions.. or inspires you to do more research 🙂
–The loon
Virtualization technologies like VMWare that have to run on hardware that doesn’t have VT-x or Vanderpool or whatever they are calling it these days, they have to use ring 1 to load the guest OS, while the host still runs in ring 0 as normal and all usermode code, regardless of guest or host, runs in ring 3. This is a hack solution. With VT-x and friends, there’s a sort of ring -1 that the hypervisor/virtualization software can run in, and then all the guests, as well as the host, run as normal in rings 0, with usermode code in ring 3.
Ring 2 is not meant to be almost like ring 3. It’s basically another ring 0. Page-level protections apply to ring 2 in the same way that they apply to ring 1 and 0. In fact, it’s only ring 3 that’s truly different from the other rings.
I see, thanks for the clarification!
I guess I was hoping for a more useful configuration ;-)… or reading too many tech docs and confusing the intended audience
However, if ring 1 and 2 are basically identical, it seems fairly useless for 2 to exist at all. I’d expect more usage if the permissions in ring 2 were somewhat “user” – programmable, basically to set it up as a go-between, where certain instructions could be utilized to write data into specially flagged ring-3 memory, and instructions could provide read-only access to flagged ring-0/1 memory, w/ write access a possible grant-right as well.
Ring 2 would not be able to create or modify interrupts or communicate directly with hardware, however it should be able to hook onto an active interrupt to specify an ‘observer.’ In this way, a ring-2 application could be used for the emplacement of lesser-privileged networking, graphics, and sound subsystems, among others.
The kernel-mode driver would merely expose read, write, and control buffers for each device based on what should be accessible from/to ring-2, while otherwise providing a basic driver interface.
The ring-2 service would register an observer on an interrupt, and the driver would be the first to handle the interrupt, after-words the ring 2 observer would be executed, which would look at what the driver did, and re-act ( i.e.: copy directly from the kernel’s buffer to a user-application’s data buffer – or merely providing a notification of a keyboard state switch ).
Naturally, I’m thinking nano/micro kernel here, ala BeOS. It would certainly allow Microsoft to slim Windows down for when they finally ditch forcefully integrated backwards-compatibility in favor of a virtualization service.
I especially like the idea of direct kernel to user data copies for networking and real-time multi-media. Nothing like waking up in the morning and realizing that why your networking is 60% slower in one OS over the other has nothing to do with the driver, but with useless memory copies merely to satisfy an API which must carefully protect itself in fear of being hijacked.
I mean, down deep, there are too many networking stacks which first copy from the device to the stack/heap, then copy this out to the kernel, which then copies it to a “port”, which then copies it into a buffer, which then copies it into a user-mode buffer, which then finally gets a chance to do something – most often copy it to the disk!!
I mean, driver->user is the shortest route here..
Not to mention a direct-to-disk (FS API, anyway) copy 🙂
Just my opinion.
Hmm.. thanks again!
–The loon
P.S. Don’t worry, my fingers will be all bloody and sore from typing at some point… or one of my alter-egos will slap me back to a different reality… hrm…
Of course, ring-2 of this type would be best served to provide the memory buffers in which the driver could write, skipping its own local buffers. This would depend on the device, naturally.
okay.. I’m going now…
You can actually have interrupts deliverable to any of the protection rings. But in practice, the OS configures it such that interrupts must be serviced by routines running in ring 0.
I highly recommend reading Intel’s IA-32 documentation. They pretty much explain how you could do any of the things you suggest.
OS/2 used Ring 2.
According to Wikipedia, OS/2 used ring 0 for kernel code and device drivers, ring 2 for privileged code (user programs with I/O access permissions), and ring 3 for unprivileged code (nearly all user programs).
This is verified on many other places including the EDM/2 programming web magazine, various articles on OS/2 debugging, etc., and is one of the reasons why various virtual machines have issues booting/running OS/2.
Edited 2008-08-26 16:17 UTC
Yeah, and Win 3.0’s DPMI ran in ring 1, apparently. Weird stuff. 😉