“Both of these articles allude to the fact that I’m working on putting the D-Bus protocol into the kernel, in order to help achieve these larger goals of proper IPC for applications. And I’d like to confirm that yes, this is true, but it’s not going to be D-Bus like you know it today. Our goal (and I use ‘goal’ in a very rough term, I have 8 pages of scribbled notes describing what we want to try to implement here), is to provide a reliable multicast and point-to-point messaging system for the kernel, that will work quickly and securely. On top of this kernel feature, we will try to provide a ‘libdbus’ interface that allows existing D-Bus users to work without ever knowing the D-Bus daemon was replaced on their system.”
A real pub-sub multicast socket implementation in the kernel. This has been at the top of my feature wish list for a long time. Obviously D-Bus will be the first client, but libevent would be next, and then the Wayland devs are beginning to rethink their IPC protocol.
In the beginning, there was select and poll. Then there was epoll. But their days may be numbered. It should be obvious that the socket is the one true UNIX abstraction for IPC. This is the way it was always meant to be. We fork a bunch of small processes, each of which does one thing well, and hook them all together with read() and write().
But we need more flexibility in how we compose networks of communicating processes. We can’t do everything with pipelines and the client-server pattern. Even a microkernel purist would agree that IPC should be provided by a point-to-point message bus in the kernel. Anything else we might want can be built on top.
butters,
Well, first of all let me state that I do share your enthusiasm for moving this sort of IPC into the kernel.
“In the beginning, there was select and poll. Then there was epoll. But their days may be numbered. It should be obvious that the socket is the one true UNIX abstraction for IPC.”
I’m not sure if you meant exactly what you said here, but select/poll/epoll are complimentary mechanisms for using sockets, sockets don’t supersede them. I’ve found that epoll is by far the most efficient way to handle many asynchronous sockets.
“This is the way it was always meant to be. We fork a bunch of small processes, each of which does one thing well, and hook them all together with read() and write().”
It’s true that multi-process blocking IO is the traditional unix way, and some people prefer that way of programming. But a big reason for phasing it out is because it scales very inefficiently compared to multi-threaded and async models.
As a popular example, take a look at the scalability differences for a traditional multi-process web server (apache MPM) versus modern asynchronous ones:
http://whisperdale.net/11-nginx-vs-cherokee-vs-apache-vs-lighttpd.h…
I should have said “execution units” instead of processes. Obviously multithreading is an improvement over multiprocessing, and multiplexing coroutines on a non-blocking thread pool is an improvement over multithreading.
But an in-kernel message bus would still ease the implementation and accelerate the performance of a modern concurrent runtime platform such as Go, whose channel type would map nicely to AF_BUS.
butters,
“I should have said ‘execution units’ instead of processes. Obviously multithreading is an improvement over multiprocessing, and multiplexing coroutines on a non-blocking thread pool is an improvement over multithreading.”
Yea, multithreaded is a large improvement over multiprocess for efficiency. According to the next link, the minimum stack size is 16K.
http://stackoverflow.com/questions/12387828/what-is-the-maximum-num…
So there is still a lot of per-client overhead that cannot be eliminated in the blocking thread model. This is why I’m a huge fan of the nginx’s type of concurrency model. If your not familiar with it, it uses a number of processes equal to the number of parallel cores. Each process on top of that uses an asynchronous mechanism like epoll. This means it can get full concurrency across CPUs and handle each client with asynchronous IO. Each client only uses as many resources (CPU and memory) without the overhead of any synchronization primitives used by threads.
I’m so happy with this model that I try to encourage others to adopt it, but often times implementations compromise the model (especially by using blocking file-IO, linux doesn’t even support asynchronous non-blocking file-IO). So an application which makes heavy use of uncached file-IO will probably do better with more threads to prevent clients from blocking each other and the async loop becoming idle.
Anyways, my personal programming philosophy aside…
“But an in-kernel message bus would still ease the implementation and accelerate the performance of a modern concurrent runtime platform such as Go, whose channel type would map nicely to AF_BUS.”
I’m afraid I haven’t learned much about go yet, despite a suggestion that I should. I am interested, how would go incorporate a kernel bus into a language feature that ordinary go programs would use?
In Go, goroutines are the fundamental execution units, which are scheduled on a thread pool of an appropriate size for the hardware, much like nginx. Goroutines interact using channels which exchange values of a type.
Channels may be bidirectional or unidirectional, and they may be buffered to accept a particular number of values before blocking the sending goroutines and unblocking the receiving goroutines.
When a goroutine is about to block, the runtime schedules any other unblocked goroutine on that thread rather than blocking the entire thread. When the I/O request is complete, the goroutine is unblocked and reinserted into the runqueue of goroutines.
Since you’re already familiar with nginx, that should be enough to get the gist of goroutines and how the runtime might use AF-BUS to implement channels.
Be all the fan you want, but you have to see it’s limitations. It’s great for stateless retrieval protocols. For everything else, it depends on the given case.
JAlexoid,
“Be all the fan you want, but you have to see it’s limitations. It’s great for stateless retrieval protocols. For everything else, it depends on the given case.”
What would give you the impression that nonblocking async is poor when it comes to stateful IO? I’ll give you that the models are rather different beasts when it comes to programming them. Windows programmers traditionally used async, unix ones traditionally used blocking io, so maybe there’s a bit of religious animosity between the models. It’s probably true that many programmers prefer dealing with sequential instructions over an event/callback oriented state machine, but still the two models really are equally expressive.
Quite a common mistake. Comparing async performance to performance of Apache. Those don’t highlight the differences in scalability of non-blocking IO, but demonstrate that Apache is big.
JAlexoid,
“Quite a common mistake. Comparing async performance to performance of Apache. Those don’t highlight the differences in scalability of non-blocking IO, but demonstrate that Apache is big.”
In the past I’ve done the socket benchmarks myself, apache was just an example, feel free to substitute whatever example you like. Blocking socket IO implies using either multi-process or multi-threaded models, I hope that we can agree that the forking multi-process model is the least scalable.
There’s nothing wrong with the multi-threaded blocking model, I don’t critisize anyone for using it. However it does imply more memory/cpu overhead than the non-blocking model due to each client using it’s own stack and the requirement of synchronization primitives that are usually unnecessary with the async model. Additionally the multithreaded solution does not increase concurrency over an asynchronous solution when the number of async handlers equals the number of cores, so right there you’ve got all the ingredients for async to come out ahead.
Mind you I think the difference is negligible when the number of concurrent users is low, but async really shines with many concurrent connections (10K+).
There is no form of IO that will imply high scalability without multi-threading or multiple processes.(nginx, lighttpd and node.js use multiple processes to scale)
What gave you the impression that:
A) threads are expensive in memory or CPU. They are quite cheap these days.
B) synchronization of blocking IO isn’t replaced with something similar in nonblocking IO. (It always is)
JAlexoid,
“There is no form of IO that will imply high scalability without multi-threading or multiple processes.(nginx, lighttpd and node.js use multiple processes to scale)”
That’s not exactly true, or it’s misleading at best. The reason nginx uses multiple processes is simply to distribute the load across all cores, but this is a *fixed* number of processes with a *fixed* overhead. All further scaling under the non-blocking asynchronous model is achieved without ANY more processes or threads REGARDLESS of the size of the workload.
With a blocking model, each client requires another thread/process, which adds up to gross inefficiencies at the upper end of the scale.
“What gave you the impression that:
A) threads are expensive in memory or CPU. They are quite cheap these days.”
Well, if it means the difference between ~ 8-32KiB per client thread versus a couple hundred bytes or for the client data structure used by the asynchronous approach, I’d say that’s a significant difference both in theory and in practice.
“B) synchronization of blocking IO isn’t replaced with something similar in nonblocking IO. (It always is)”
I’m not sure what you mean here, but I was referring to multi-threaded programming needing to implement synchronization around shared data objects between threads (for client accounting or shared caches or whatever), but with the async model there’s usually no need to grab a mutex because each client is handled within one thread.
Your responses give me the impression you’ve never written an epoll based handler, am I right?
See the epoll documentation:
http://linux.die.net/man/2/epoll_wait
They’ve done a number of things to make the interface highly efficient, for one we can retrieve hundreds of events from the kernel in one fell swoop. For another, unlike the earlier select/poll syscalls we don’t have to constantly specify the sockets to be monitored at each syscall. It may not be for everyone, but I swear by it.
Since you seem not to believe me about the better efficiency of the non-blocking/async model, I challenge you to find an example where a multi-threaded / multi-process solution performs as well as or better than the asynchronous model at very large scales.
They are moving D-BUS ino the kernel, and the virtual terminals into userspace (http://linux.slashdot.org/story/13/02/08/159218/moving-the-linux-ke…).
Are we looking here at a very careful attempt to slowly turn the Linux kernel into a microkernel?
Nope
Its just adding features that are needed by modern computers.
D-Bus made huge inroads into Linux because of its value. Its hight level sockets abstraction.
AF_BUS is similar to D-Bus but in Kernel already (3.4 LTI), but its for automotive industry mostly. Something more general is cooked. And that is good.
Not as long as Linus has anything to say about Linux development as he opposes to micro-kernel architectures.
Well… We all know how that discussion ended last time.
“F**k you nVidia!” Is child’s play
Really? FUSE and udev were implemented under his watch. Linus is not a very ideological person. He opposed microkernel architecture for pragmatic reasons, mainly because running device drivers in userspace leads to poor interrupt response times.
But as we’ve seen with hybrid kernels like NT and XNU, there is still value to microkernel architecture even if the device drivers and resource abstractions are mapped onto every address space for performance reasons.
If Linux had embraced a more microkernel-like architecture with a kernel message bus, we might have avoided the whole saga of excising the Big Kernel Lock and the BH device drivers which severely hampered the kernel’s ability to scale on SMP hardware.
There is still a stigma in the Linux community about microkernel design aesthetics, probably best demonstrated by the meager adoption of the workqueue interface introduced in the 2.5 series. This allows drivers to queue their work to run on a pool of kernel threads, where they can use sleeping locks and blocking I/O for easy concurrency. But the inclination is to avoid sleeping or blocking and use tasklets instead, even though tasklets are strictly serialized against themselves and do not scale on SMP hardware.
Here we are in 2013, and we’re still so worried about the overhead of switching a stack pointer that almost all of our device drivers are single-threaded, and the few that aren’t (network and scsi) resort to per-cpu data structures and other tricks because they are running in interrupt context and cannot block.
butters,
“If Linux had embraced a more microkernel-like architecture with a kernel message bus, we might have avoided the whole saga of excising the Big Kernel Lock and the BH device drivers which severely hampered the kernel’s ability to scale on SMP hardware.”
I really enjoy dissecting technologies, especially how it could be made better, but if we don’t tread carefully, it often devolve into a religious spat
“Here we are in 2013, and we’re still so worried about the overhead of switching a stack pointer that almost all of our device drivers are single-threaded, and the few that aren’t (network and scsi) resort to per-cpu data structures and other tricks because they are running in interrupt context and cannot block.”
Do you think it’s really a problem? Why would a single threaded driver be a problem when the hardware it controls needs to be programmed sequentially anyways? I’m not really sure what kind of hardware would benefit from concurrent threads trying to program it at the same time without a mutex?
There is a helluva lot more to a micro-kernel than just providing IPC. You have to actually start using it for in-kernel stuff. For example, I don’t see DBUS being used underneath the syscall interface anytime soon (or ever), but once its firmly in the kernel it could make sense for a few things where simplicity and flexibility trump performance. If a truly logical argument for such a usage case is ever made, I don’t really expect that their would be much resistance – even from Linus. Even then, selective use doesn’t turn Linux into a micro-kernel – its just pragmatism.