The Arrakis research operating system

Submitted by BloopFloop 2014-05-20 OS News 20 Comments

Arrakis is a research operating system from University of Washinton, built as a fork of Barrelfish.

In Arrakis, we ask the question whether we can remove the OS entirely from normal application execution. The OS only sets up the execution environment and interacts with an application in rare cases where resources need to be reallocated or name conflicts need to be resolved. The application gets the full power of the unmediated hardware, through an application-specific library linked into the application address space. This allows for unprecedented OS customizability, reliability and performance.

The first public version of Arrakis has been released recently, and the code is hosted on GitHub.

About The Author

Thom Holwerda

Follow me on Mastodon @thomholwerda@exquisite.social

20 Comments

2014-05-20 9:54 pm
Dirge
Cool I have been seeing allot more OS and hobby OS news lately. Much prefer it to smartphone news. Cudos

2014-05-22 6:21 pm
bassbeast
I agree the more OS stuff the better, there is really no point in covering smartphone hardware as its “here today, gone this afternoon” and by the time the story reaches here its already been beaten by something better.
As for TFA my question would be…what is the difference between this and an application based VM? As that is what it sounds like they are going for, to have a basic hypervisor that only handles basic tasks like I/O and memory while the program gets center stage. This could be sweet for non Internet based applications but anything connected to the net would be at serious risk running “bare metal” hence why browsers and AVs have sandboxing, having separation of the application from the hardware is a good thing if it has access to the net thanks to bad actors.
I know TFA says the “application will handle it” but if that worked we wouldn’t have zero days, no program is perfect and having layers at least makes multiple blocks a bad actor has to bypass to get to the hardware. If I’m reading this right the bad guy would only have to bypass the single application to have unrestricted access to the hardware…ohh that isn’t good, not good at all.

2014-05-22 10:21 pm
Alfman verbose=1
bassbeast,
I know TFA says the “application will handle it” but if that worked we wouldn’t have zero days, no program is perfect and having layers at least makes multiple blocks a bad actor has to bypass to get to the hardware. If I’m reading this right the bad guy would only have to bypass the single application to have unrestricted access to the hardware…ohh that isn’t good, not good at all.
I do not know the answers to your questions. It looks like this is more for scientific research than serving applications, security may simply be out of scope.
However speaking hypothetically about the security of this model, when the application is running on bare metal, the raw performance probably increases a little, but it’s not obvious to me how bare metal applications could increase the scope of a breach?
Consider a regular user process on linux is successfully hacked, the hacker can access everything that the application could access (databases/files/network/etc). Other unrelated services may be isolated by the OS, but are still potentially at risk anyways even if the OS does not become compromised (smtp daemon, /etc files, unprotected home directories, side channel attacks, etc).
To be clear, I’m in favor of having OS security as well, but just to play devil’s advocate: doesn’t running a network facing application on a multiuser OS _increase_ the vulnerability surface compared to just running a network facing application with no OS?

2014-05-23 6:49 am
bassbeast
Because they would have total unrestricted access to the bare metal and thanks to flashable firmware they would be able to embed the malware below even the OS level?
IIRC there was a blackhat convention showing a couple years back that did just that, once they pwned the OS they then wrote code in both the BIOS and the firmware of some hardware, maybe wireless, that left them a backdoor capable of loading code. after that they had the malware load BEFORE the OS by using the VM capability of the hardware, basically ended up with the OS running as a kind of VM on top of the malware so that even an AV would have no way to detect it.
Luckily it was just a proof of conecpt which required having knowledge and a zero day for the browser AND the OS AND the BIOS of that particular board but with this? All you’d have to do is bypass the Internet browser and you’d own the system, nothing between you and bare metal…BAD idea.

2014-05-23 2:15 pm
Alfman verbose=1
bassbeast,
IIRC there was a blackhat convention showing a couple years back that did just that, once they pwned the OS they then wrote code in both the BIOS and the firmware of some hardware, maybe wireless, that left them a backdoor capable of loading code. after that they had the malware load BEFORE the OS by using the VM capability of the hardware, basically ended up with the OS running as a kind of VM on top of the malware so that even an AV would have no way to detect it.
You are referring to this:
http://www.technologyreview.com/news/428652/a-computer-infection-th…
Ok, fair enough. A solution already exists in the form of attestation (with TPM), but I don’t know if any AV products make use of it? Regardless of whether there’s an OS or a baremetal application, it would be prudent to make use of hardware attestation, without this there’s no way to prove that the environment hasn’t been compromised regardless of OS.
Luckily it was just a proof of conecpt which required having knowledge and a zero day for the browser AND the OS AND the BIOS of that particular board but with this? All you’d have to do is bypass the Internet browser and you’d own the system, nothing between you and bare metal…BAD idea.
Well, I’m not really familiar enough with the project to answer these questions. Going by the quote below (from barrelfish, from which arrakis was forked), there is some attention to security. However the Arrakis website is really vague on what is meant with statements like “Applications are becoming so complex that they are miniature operating systems in their own right and are hampered by the existing OS protection model.” So I really couldn’t tell you whether Arrakis has any additional security risks with regards to BIOS flashing or not.
http://www.barrelfish.org/TN-000-Overview.pdf
Note that, while the user determines exactly what mapping is entered in the page table, the process is
secure: the user cannot map a frame that they do not already hold a capability for, that capability must
refer to a frame (and not some other type of memory, such as a dispatcher control block or CNode), and
they must also hold a capability for the page table itself.
Similarly, in this example, for the Page Table page to be useful it must itself be referenced from a Page
Directory. To ensure this, the user program must perform a similar invocation on the Page Directory
capability, passing the slot number, flags, and the capability to the Page Table page. As before, the
capability type system allows neither the construction of an invalid page table for any architecture, nor
the mapping of any page that the user has no authorization for.

2014-05-20 10:42 pm
ameasures
The performance numbers are very impressive. The other (different but just possibly comparable) operating system is DragonFlyBSD which posted some really quick database performance stats a little while back.
Unfortunately although the two platforms are both benchmarked against Linux; DFBSD uses PostgreSQL whilst Arrakis uses Redis NOSQL.
It would be interesting to compare the two over the same tasks.
It would also be interesting to see whether aspects of each could be grafted into the other for even more impressive gains.
2014-05-21 12:30 am
Dirge
It will be most excellent if the outflow of this project are more secure web browsers and PCs in general. I guess were a long way off but one can hope.
Edited 2014-05-21 00:30 UTC
2014-05-21 4:56 am
siraf72
I like it for the Dune reference.
2014-05-21 5:43 am
shmerl
This looks very interesting. I’ll keep track of this project.
2014-05-21 8:17 am
thesunnyk
I’ve always been toying with the idea of forking and continuing with Barrelfish. The idea, if you’re not aware, is to have a kernel per CPU core. This allows you to think about your computer in an inherently more distributed sense, pushing computation out over the network or otherwise having your computer “span” devices or even the internet.
Arrakis seems to be moving in this exact direction. Great news.

2014-05-21 9:51 am
Alfman verbose=1
thesunnyk,
I’ve always been toying with the idea of forking and continuing with Barrelfish. The idea, if you’re not aware, is to have a kernel per CPU core. This allows you to think about your computer in an inherently more distributed sense, pushing computation out over the network or otherwise having your computer “span” devices or even the internet.
I like this idea as well! Not sure if it’d be useful for normal people, but what it offers is kind of an alternate to VPS, with dedicated resources yet less than the cost of a dedicated server. This model makes a lot of sense especially with NUMA systems, which are inherently more scalable than uniform memory access due to the overhead of cache coherency that x86 mandates.
Edited 2014-05-21 09:52 UTC

2014-05-21 3:08 pm
Megol
thesunnyk,
I’ve always been toying with the idea of forking and continuing with Barrelfish. The idea, if you’re not aware, is to have a kernel per CPU core. This allows you to think about your computer in an inherently more distributed sense, pushing computation out over the network or otherwise having your computer “span” devices or even the internet.
QNX have some support for this using a network as the “bus” layer.
Other systems have been designed for supporting distributed single system image. Limiting the support to kernel design isn’t the best way towards that, many other system layers should be adapted to support variable latency and proper handling of link failures.
I like this idea as well! Not sure if it’d be useful for normal people, but what it offers is kind of an alternate to VPS, with dedicated resources yet less than the cost of a dedicated server. This model makes a lot of sense especially with NUMA systems, which are inherently more scalable than uniform memory access due to the overhead of cache coherency that x86 mandates.
Do you know that NUMA was first used in systems without x86 processors? Do you realize that much of the work doing scalable coherency protocols have been done on RISC systems?
In short: this isn’t something x86 specific, it’s common for all systems following the Von Neumann design.

2014-05-21 7:14 pm
Alfman verbose=1
Megol,
QNX have some support for this using a network as the “bus” layer.
Other systems have been designed for supporting distributed single system image. Limiting the support to kernel design isn’t the best way towards that, many other system layers should be adapted to support variable latency and proper handling of link failures.
Well, the trouble with this is that NUMA is being designed to solve some inherent scalability problems of shared memory systems. And although you can apply some hacks to SMP operating systems to better support NUMA, generic SMP / MT software concepts are flawed by shared memory design patterns that fundamentally cannot scale. In other words they reach diminishing returns that cannot be overcome by simply adding more silicon.
I’m only vaguely familiar with Barrelfish, but one of it’s goals is to do away with the design patterns that imply serialization bottlenecks, which are common to conventional operating systems today. In theory all operating systems could do away with the serial bottlenecks too, but not without “Limiting the support to kernel design” as you said. Physics is eventually going to force us to adopt variations of this model if we are to continue scaling.
Edited 2014-05-21 19:14 UTC

2014-05-22 10:57 pm
MrVain
“…Well, the trouble with this is that NUMA is being designed to solve some inherent scalability problems of shared memory systems. And although you can apply some hacks to SMP operating systems to better support NUMA, generic SMP / MT software concepts are flawed by shared memory design patterns that fundamentally cannot scale. In other words they reach diminishing returns that cannot be overcome by simply adding more silicon…”
And this is exactly why the largest SMP servers on the market has 32 sockets, like the IBM P795 Unix server. Some IBM Mainframes have 32 sockets as well. Fujitsu even has an 64-socket Solaris server M10-4S. The largest SMP servers on the market has 32 sockets, a few (only one?) has 64 sockets. Sure, these are not really true SMP servers, they have some NUMA characteristiscs as well. But in effect, they behave like true SMP servers, for example look at the bottom picture, and you will see that each cpu is connected to another cpu in at most 2-3 hops – which essentially is like a true SMP server:
http://www.theregister.co.uk/2013/08/28/oracle_sparc_m6_bixby_inter…
OTOH, a numa cluster (all numa servers are clusters):
http://en.wikipedia.org/wiki/Non-uniform_memory_access#NUMA_vs._clu…
like the SGI Altix or UV2000 server, or ScaleMP server – both has 100.000 cores and 100s of TB. All these servers have awfully bad latency when trying to reach cpus far away – true hallmark of a NUMA cluster. These Linux clusters can not be used to run SMP workloads, they are only fit for HPC parallel workloads. Typical SMP workloads are large Enterprise business systems, databases in large configurations, etc – where each cpu needs to talk to each other frequently, so there is lot of traffic and communication – SMP workloads. HPC workloads are run on separate nodes, not much communication – typically HPC workloads. All servers with more than 32 sockets on the market, are HPC clusters. Like the Linux servers with 10.000s of cores, or even 100.000 cores – they are all clusters. Not a single fat SMP server:
Regarding the huge ScaleMP Linux server with 1000s of cores and gobs of terabytes of RAM, yes it is a cluster that is tricked into believing it is a single huge fat SMP server running single image Linux kernel. It can not run SMP workloads, only run the easier HPC number crunching:
http://www.theregister.co.uk/2011/09/20/scalemp_supports_amd_optero…
Since its founding in 2003, ScaleMP has tried a different approach. Instead of using special ASICs and interconnection protocols to lash together multiple server modes together into a SMP shared memory system, ScaleMP cooked up a special software hypervisor layer, called vSMP, that rides atop the x64 processors, memory controllers, and I/O controllers in multiple server nodes….vSMP takes multiple physical servers and â€“ using InfiniBand as a backplane interconnect â€“ makes them look like a giant virtual SMP server with a shared memory space. vSMP has its limits.
…
The vSMP hypervisor that glues systems together is not for every workload, but on workloads where there is a lot of message passing between server nodes â€“ financial modeling, supercomputing, data analytics, and similar parallel workloads. Shai Fultheim, the company’s founder and chief executive officer, says ScaleMP has over 300 customers now. “We focused on HPC as the low-hanging fruit
And regarding the SGI Altix and UV1000 Linux server with 1000s of cores and gobs of RAM, it is also a HPC number crunching server – it is not used for SMP workload, because it does not scale well enough to handle such difficult workloads. SGI says explicilty that their Linux servers are for HPC only, and not for SMP.
http://www.realworldtech.com/sgi-interview/6/
The success of Altix systems in the high performance computing market are a very positive sign for both Linux and Itanium. Clearly, the popularity of large processor count Altix systems dispels any notions of whether Linux is a scalable OS for scientific applications. Linux is quite popular for HPC and will continue to remain so in the future,
…
However, scientific applications (HPC) have very different operating characteristics from commercial applications (SMP). Typically, much of the work in scientific code is done inside loops, whereas commercial applications, such as database or ERP software are far more branch intensive. This makes the memory hierarchy more important, particularly the latency to main memory. Whether Linux can scale well with a SMP workload is an open question. However, there is no doubt that with each passing month, the scalability in such environments will improve. Unfortunately, SGI has no plans to move into this SMP market, at this point in time
2014-05-23 2:13 am
Alfman verbose=1
Kebabbert,
The largest SMP servers on the market has 32 sockets, a few (only one?) has 64 sockets. Sure, these are not really true SMP servers, they have some NUMA characteristiscs as well. But in effect, they behave like true SMP servers
“Like” being the key word. We can’t just take a “true SMP” shared memory application and scale it up to 32 or 64 sockets *efficiently*. That won’t stop us from trying of course, but there are significant bottlenecks in doing so. The best you can do (in terms of performance) is to apply CPU affinity to solve excessive inter-core traffic and then run more independent processes as though they were in a real cluster.
for example look at the bottom picture, and you will see that each cpu is connected to another cpu in at most 2-3 hops – which essentially is like a true SMP server:
http://www.theregister.co.uk/2013/08/28/oracle_sparc_m6_bixby_inter…..
That’s an impressive beast! I could not find good benchmarks though. I’d wager a guess that it performs poorly (in relationship to it’s outlandishly high specs) on IPC/synchronization workloads across sockets.
OTOH, a numa cluster (all numa servers are clusters):
I agree with everything in this paragraph. To use NUMA effectively, you have to break the constructs, including syscalls, that assume the normal SMP semantics. Once you do that, it begins to look more and more like a cluster instead of SMP. This is the direction I think Barrelfish is aimed at.
And regarding the SGI Altix and UV1000 Linux server with 1000s of cores and gobs of RAM, it is also a HPC number crunching server – it is not used for SMP workload, because it does not scale well enough to handle such difficult workloads. SGI says explicilty that their Linux servers are for HPC only, and not for SMP.
These are very interesting links, thank you for posting them.
I think linux should do a better job of using remote/clustered resources. Like executing processes remotely and transparently as though the remote CPU were a virtual CPUs on the local machine. Also migrating live processes to remote systems (similar to vmware but on a per-process level).
I’ve been toying around with virtual ram using physical ram from remote systems. It’s a racy hack, but adding swapfiles to network block devices do the trick without custom kernel drivers. I also want to try making a homebrew “battery backed disk cache” by using bcache (http://bcache.evilpiepirate.org/) using an ordinary server with several gigabytes of ram and a UPS over a gigabit network to see how well that performs. This is not without risk of new failure modes of course, but I thought it’d be fun to have a computer with zero disk latency even across reboots. Of course now that SSD is getting to be common, it’d be mostly a novelty
2014-05-24 2:54 am
richarson
Hi everyone, I’m a long time OSNews reader but this is my first post.
These are very interesting links, thank you for posting them.
I think linux should do a better job of using remote/clustered resources. Like executing processes remotely and transparently as though the remote CPU were a virtual CPUs on the local machine. Also migrating live processes to remote systems (similar to vmware but on a per-process level).
Hey Alfman,
Are you familiar with the (now discontinued) Mosix/OpenMosix projects?
https://en.wikipedia.org/wiki/OpenMosix
There was supposed to be a continuation in the form of LinuxPMI:
https://en.wikipedia.org/wiki/LinuxPMI
But it doesn’t seem to have a lot of momentum, which is unfortunate as it seemns to be really cool.
I’m not a coder, let alone a kernel progammer, but I thought it might interest you and other OSNews readers.
Cheers.
2014-05-24 2:02 pm
Alfman verbose=1
richarson,
Hey Alfman,
Are you familiar with the (now discontinued) Mosix/OpenMosix projects?
https://en.wikipedia.org/wiki/OpenMosix
Very interesting. Wikipedia says:
“On July 15, 2007, Bar announced that the openMOSIX project would reach its end of life on March 1, 2008, due to the decreasing need for SSI clustering as low-cost multi-core processors increase in availability.”
This sounds like an odd reason to me. While the multi-core “revolution” did take place (even temporarily offsetting the need for clusters), the need for transparently clustered nodes still exists. Right now this gets functionality needs to be pushed into individual applications (ala mysql enterprise cluster), where each such application needs to support it’s own clustering protocols in it’s own proprietary way. It would be awesome to have something built into the kernel for clustering in the same way that we’ve built kvm into the kernel for virtual machines.
It’s too bad this didn’t make it into mainline, many worthy projects never do though. It sounds like openMosix’s investor lost interest in funding it. My own linux distro ended up in the same spot around the same time. I enjoyed working on it but it wasn’t profitable and I had no sponsors.
There was supposed to be a continuation in the form of LinuxPMI:
https://en.wikipedia.org/wiki/LinuxPMI
But it doesn’t seem to have a lot of momentum, which is unfortunate as it seemns to be really cool.
Wow, I can see why you thought of this when you read my post! Some time ago I was looking at similar (yet different) projects for migrating processes between systems using process checkpointing.
http://readtiger.com/wkp/en/Application_checkpointing
I might try to give LinuxPMI a spin since it looks like exactly what I was thinking! Honestly though I’ve lost a lot of motivation in working with kernel patches, I try to stick to mainline even when it means loosing access to features like AUFS.

2014-05-21 4:47 pm
torp
Can’t really figure it out from their description. It’s just a bunch of libraries for an application to use, but the app gets access to the underlying hardware and is free to mess it up?
Well, I guess I understand but can’t see the reason, except for very specialized environments.

2014-05-21 6:00 pm
h5n1xp
From what I have read, I kind of agree tht this sort of looks like MSDos or CP/M but taking advantage of hardware virtualisation… It’s certainly a fun experiment, (a bit like Microsoft’s singularity was fun to watch develop) but I have difficulty following how it has any advantages.
The main role of the OS is to multiplex resources and events… You still need this function, and I can’t see where this project does that efficiently.

2014-05-24 12:11 pm
fithisux
I would like to see more uKernel OSes appear.