Linux: the Really Fair Scheduler

Submitted by lomax 2007-08-31 Linux 40 Comments

During the many threads discussing Ingo Molnar’s recently merged Completely Fair Scheduler, Roman Zippel has repeatedly questioned the complexity of the new process scheduler. In a recent posting to the Linux Kernel mailing list he offered a simpler scheduler named the ‘Really Fair Scheduler’ saying, “as I already tried to explain previously CFS has a considerable algorithmic and computational complexity. This patch should now make it clearer, why I could so easily skip over Ingo’s long explanation of all the tricks CFS uses to keep the computational overhead low – I simply don’t need them.”

About The Author

Thom Holwerda

Follow me on Mastodon @thomholwerda@exquisite.social

40 Comments

2007-08-31 9:34 am
martinus
May the best technology win!
2007-08-31 9:37 am
egil
Yes, because pluggable schedulers is too much to ask for…

2007-08-31 11:39 am
bbarker
I’d also like to know why pluggable schedulers aren’t an option, any lkml readers have any clue?

2007-08-31 11:55 am
renox
Because Linus doesn’t want to see tens of schedulers with magical recipes such as “if you’re doing this use scheduler A, and if you’re doing that use scheduler B..”.
2007-08-31 11:56 am
fernape
As with other issues, it’s just because Linus doesn’t want/like. He prefered a good scheduler with an acceptable behaviour in all cases instead of different schedulers. Sorry but I can’t search for the link now, but I read this in kerneltrap.org during the Kolivas vs Molnar thread.
In my opinion this is quite debatable. I can’t picture a scheduler that behaves properly in an IBM machine with 128 processors, running cpu extensive calculations (databases, calc processes) and at the same time in a laptop running OpenOffice, Amarok, Mplayer and Pidgin.
2007-08-31 11:57 am
lucke
Linus repeatedly stated that he is strongly against plugsched – IIRC, he prefers one scheduler that is fairly good at all workloads than lots of schedulers tuned to specific workloads.

2007-08-31 2:11 pm
diegocg
If it was only Linus – most of the kernel hackers seem to agree with him. And if this scheduler is better than Ingo’s why don’t just replace it….
Edited 2007-08-31 14:13

2007-08-31 1:17 pm
Anonumous
http://kerneltrap.org/node/14019
From Linus:
This is one approach, but it’s actually one that I personally think is often the worst possible choice.
Why? Because it ends up meaning that you never get the cross-pollination from different approaches (they stay separate “modes”), and it’s also usually really bad for users in that it forces the user to make some articular choice that the user is usually not even aware of.

2007-08-31 1:45 pm
sbergman27
I tend to agree with Linus’ tastes on these matters. I have a bunch of users on Gnome desktops running XDMCP, plus a database server and webserver. Plus about a hundred instances of an ncurses based point of sale and accounting package. Which scheduler should I use?
The default process scheduler should have good performance for a very wide range of tasks. Ditto for the i/o scheduler. Pluggable scheduling is a band-aid and a cop out. I would rather that the default Linux scheduler be “beautiful”, in a technical sense.
And the issue of a “non-interactive” process doing work on behalf of another, interactive, process, otherwise known as the X11 problem, needs to be solved once and for all. And while I’m at it, it would be nice to see the SCHED_IDLE dilemma finally resolved, as well. Oh… and a pony! 😉
Edited 2007-08-31 14:01

2007-08-31 2:08 pm
dylansmrjones
Personally I don’t see the problem in having specialized schedulers as long as there is an all-round scheduler (as default). This way both camps get their wishes fulfilled.
2007-08-31 2:18 pm
sbergman27
And I would agree with that. But the really good all around scheduler needs to be in place first. And *then* the “Mostly Fair, Rotating, Porn Viewing While Doing Word Processing And Hosting A Quake Server, Spiral Staircase” scheduler can go in as a module.
Edited 2007-08-31 14:22
2007-08-31 4:25 pm
SReilly
Well said. It seems to me a one size fits all scheduler would be the better approach, just for the sake of simplicity.
As for wanting a pony, so does my girlfriend! ;-p
2007-08-31 6:15 pm
sbergman27
“””
As for wanting a pony, so does my girlfriend! ;-p
“””
Hey, this is really off-topic. But I just noticed something, which struck me in that way that odd things just strike one, sometimes. Your user page says you live in Luxembourg. I live in Oklahoma City, Oklahoma, in the US.
Now, mind you, I’ve been used to the global nature of the internet for about 12 years now.
But isn’t it just absolutely cool, when you think about it, that I can make an off-hand remark, and you can respond about your girlfriend’s fondness for cute little ponies… casually… and from so many thousands of miles away?
Sometimes one has to stop and take note of one’s blessings; Of the easily ignored flowers along the way.
I was born in 1963. And the world was nothing like this back then. 🙂
Edited 2007-08-31 18:16
2007-09-01 12:48 pm
SReilly
I think it’s a pretty dam cool thing myself.
It’s nice to be able to talk to people cheaply, with or without similar interests, from all over the world even though, traditionally there would be costly infrastructure barriers to surpass.
Luxembourg has a very transient expatriate community and although I have spend most of my life here, many of my friends and family either live abroad or are traveling around. Being able to stay in contact, by either blog, social networking site or even good old email, gives us all the ability to keep up our friendship and share our experiences. It’s also great for organizing gettogethers.
Having been born in ’77, I still remember the ‘dark days’ 😉
P.S. My girlfriend is really greedy! She already has a pony called ‘Touchwood’.

2007-08-31 2:10 pm
diegocg
Why would you use a different scheduler for a patch that apparently it’s just an improvement to CFS? It’s not like he’s proposing a completely new scheduler.
2007-08-31 9:05 pm
butters
Yes, because pluggable schedulers is too much to ask for…
CFS was in fact merged with a modular scheduler framework. It’s not an either-or kind of framework, though. It allows scheduler modules to focus on a subset of the tasks on the system. For example, the current implementation provides a scheduler module just for realtime tasks. These modules execute in series. If the realtime module doesn’t select a task, CFS picks one.
However, since the framework defines a clean API for scheduler modules, it is now very easy for developers to ship alternative schedulers that are stable across kernel versions. These schedulers would be compatible with the father of all plugging utilities, patch. Just drop in the new scheduler as a separate source file and change a single pointer in sched.c.
Although the arguments about scheduler proliferation and cross-pollination are extremely valid, the ultimate argument is this: why would anyone want to change schedulers on a live system? This seems like a useless and potentially dangerous capability. Patch, build, reboot. Behold, scheduler changed.

2007-08-31 11:13 am
Oliver
May the best politician win :o)
2007-08-31 12:22 pm
Luminair
Ingo’s reply is here: http://lkml.org/lkml/2007/8/31/97
Basically he politely shoots down the dude’s work by saying that someone has already done the same stuff.

2007-08-31 2:20 pm
terog
Roman Zippel’s reply:
http://lkml.org/lkml/2007/8/31/126
This one is actually more interesting to read…
Edited 2007-08-31 14:21

2007-08-31 3:41 pm
sbergman27
I’m not an Ingo detractor. Not by a long shot. I’m an Ingo supporter. And I’m pretty sure that the Linux kernel would not be what it is today without his excellent work.
But is he not getting a bit cocky and dismissive of other people’s work? I gave him the benefit of the doubt regarding the Con Kolivas conflict.
Using the phrases “please” and “thank you” liberally is not a replacement for treating others with respect. I’m beginning to get the impression that Ingo is confusing politeness with respect.
I’m wondering if Ingo doesn’t have a “slap up ‘side the head” coming from Linus. Linus doesn’t have a lot of patience with ego problems.

2007-08-31 3:49 pm
renox
I don’t know: I’m surely not an Ingo supporter, but if I understood correctly Ingo’s reply, Roman Zippel benchmarked his scheduler against an “old” version of Ingo’s scheduler.
So if Roman Zippel wants to be taken seriously, he should benmark it against the current version..

2007-08-31 4:14 pm
hechacker1
Actually, Zippel used the latest 2.6.23.* kernel while benchmarking. The math improvements Ingo is talking about have not been integrated into CFS. Those improvements are due in 2.6.24 (and btw they are not by Ingo).
Ingo is once again being defensive of his work and ignoring/downplaying the work of the competition. Zippel’s response seems oddly familiar to Con’s… Zippel realizes that CFS is riddled with built in complexity (something I have argued about on OS news for a while). SD and now the RFS are trying to remove the unneeded complexity.
Has anybody tried the latest CFS? It now even has MORE tuneables than the last time I reported on the issue. To me tuneables suggest unneeded algorithmic complexity. At the very least it means the scheduler isn’t fair by default and needs to be tuned to be fair. Once again I remind you that while SD was not perfect, it only had 1 tuneable in /proc/sys/kernel while still being pretty damn close to a “perfectly fair scheduler.” In the latest CFS I don’t even know what the tuneable’s do since Ingo doesn’t document them (and his design document is out of date).
But AFAIK Zippel hasn’t removed the complexity in the tuneables, but removed some of the complexity and rounding errors within the math of CFS. I’ll have to try this later.
2007-08-31 4:18 pm
sbergman27
How old can it be? 😉
It was written in a sort of marathon coding binge a few months ago… without Ingo having told anyone what he was doing. Which is why I find this statement from Ingo to be a bit on the hypocritical side:
“””
In terms of merging
your stuff, your patch looks a bit large and because you did not tell us
that you were coding in this area, you probably missed…
“””
It looks to me like Ingo might be playing “King Of The Mountain”. And he’s the King.
I’m sure it will all get sorted out in the end, though.
Edited 2007-08-31 16:19
2007-08-31 4:37 pm
SReilly
Yeah, that sounds a tad hypocritical alright.
On the other hand, Roman seems to be acting a bit like a jerk. I think his reply was way too pissed off considering that Ingo wasn’t being nasty. Sure, if he thinks Ingo misunderstood what his patches do, then point it out. But flying off the handle didn’t do Con any good either.
2007-08-31 4:44 pm
sbergman27
“””
Yeah, that sounds a tad hypocritical alright.
On the other hand, Roman seems to be acting a bit like a jerk.
“””
If they were dogs, I’d recommend neutering. 😉
2007-08-31 4:52 pm
SReilly
If they were dogs, I’d recommend neutering. 😉
Lol! Brilliant solution!
I haven’t laught this much in a long time. Man, my sides hurt!

2007-08-31 6:16 pm
baadger
Ever notice that Ingo Molar is always extremely polite and non-accusive?
Statements like “Did you even try to understand what I wrote?” and accusing someone else in a discussion of making “wild claims” (or any “claim” for that matter) is not a good way to look like you’re not just getting stroppy and demanding.
Edited 2007-08-31 18:19

2007-08-31 4:45 pm
meianoite
Basically he politely shoots down the dude’s work by saying that someone has already done the same stuff.
Great, Mr. Molnár, talk all you want about “someone has already done the same stuff”… How about looking at your own work?
Here’s part of Ingo Molnár’s own description of CFS (2007)(1):
“The rq->fair_clock value tracks the ‘CPU time a runnable task would have fairly gotten, had it been runnable during that time’. So by using rq->fair_clock values we can accurately timestamp and measure the ‘expected CPU time’ a task should have gotten. All runnable tasks are sorted in the rbtree by the “rq->fair_clock – p->wait_runtime” key, and CFS picks the ‘leftmost’ task and sticks to it. As the system progresses forwards, newly woken tasks are put into the tree more and more to the right – slowly but surely giving a chance for every task to become the ‘leftmost task’ and thus get on the CPU within a deterministic amount of time.”
Now, here’s part of the description of Stride Scheduling (1995)(2):
“[…] A stride scheduler is based upon a virtual-time measurement that enables it to determine the process to be selected for the next time slice.
[…]
Each process is associated with a dynamic property which denotes an indicator for scheduling relative to other processes. This property is called the pass of a process. Allocations are performed by selecting the process with the currently minimal pass value. After consuming its quantum the process’ pass value is advanced by an increment which is inversely proportional to the amount of tickets allocated. As a consequence, passes of processes advance with inversely proportional speed of their allocations. This increment is called a stride. To keep integer values, we multiply the inverse proportional with a large integer constant $stride_1$.”
(1) http://people.redhat.com/mingo/cfs-scheduler/sched-design-CFS.txt
(2) http://wwwagss.informatik.uni-kl.de/Projekte/Squirrel/stride/node3….
Come on, Mr. Molnár, if you want to argue about deficiencies of someone else’s work, do NOT point at a third party’s work, point that one is a paraphrase of the other, *and* ditch both in the process, especially when you did the very same thing.

2007-08-31 8:30 pm
tyrione
Obviously, English is a second language by him [Ingo] and perhaps he should hire an editor to better clarify his work.
In fact, the entire community would benefit having technical writers proofread the explanations of code within the kernel.
It’s bad enough that some topic is arcane, but it doesn’t help when the designer can only explain it clearly in their native tongue which isn’t the “accepted” language of business.
If Russian, German, French, Spanish, Portuguese and whatnot were the standard language of Linux and the authors still wrote it as a second language it only weakens their work by such poor grammar.
Edited 2007-08-31 20:30
2007-09-01 7:32 am
renox
There’s ‘done’ and ‘done’: Ingo Molnar shoot down the proposal because someone has already done the same work to the CFS scheduler which is already in the kernel.
Nothing to do with having some previous research on the topic done.

2007-08-31 1:08 pm
smitty_one_each
You can patch the kernel to plug in any scheduler you like, even one you compose yourself.
A better question might be: what is the point of swapping out schedulers while running? .tsop-dim ni noitcerid gnipyt gnignahc ekil fo tros s’tahT

2007-08-31 2:10 pm
dylansmrjones
Changing direction in midpost is not so much of a problem. Some of us are pretty good at reading the wrong way. Writing the wrong way is much more difficult.
. o O ( That’s something you know if you have a warped sense of humor )

2007-08-31 2:47 pm
bryanv
Seriously funny though.
2007-08-31 4:17 pm
chuck
This one is actually more interesting to read…
Heh. My impression of Roman is that he is an arrogant fellow who has taken a dislike to Ingo. This personality conflict has been going on for some time in the CFS discussions, but I gathered my impressions of Roman from earlier, unrelated posts and his apparent trouble in working with others on the suspend code. What I find more interesting are Ingo’s evolving methods of dealing with this gadfly. Oh, and Galbraith’s tone is interesting also. Let’s see what Roman’s personality sparks from others working on the code.
2007-08-31 6:27 pm
bnolsen
This whole scheduler debacle has done nothing but make me wonder WTH is going on?
I recall that although ingo’s scheduler uses and RB-tree it’s claimed to be O(1) because xxxK jobs only require 15 (whatever number) this. At that point it was evident that B$ needed to be called.
Seems there is lots of politics going on. Probably more than should be.
Maybe kick all the bums out and have a contest on who can write a nice clean clear scheduler with not a lot of code or hack-about mumbo jumbo with snowballing to try to justify it. Maybe Linus needs to smack all these “my code/your code” guys and get back to a culture of “this is the code, it all sucks, some parts sucks less than other parts, get over it”
At least this scheduler is clear and very proveable. Easy to understand and easy to prove is good. Elegant design generally runs better than hacked stuff. The biggest problem always is finding the best “elegant design”.
Edited 2007-08-31 18:32 UTC

2007-08-31 6:53 pm
chuck
Most of the politics goes on in the comment threads at places like LWN, OSNews, and KernelTrap. Clear and elegant they aren’t, and O(N) to boot. I think we should shut down all the comment threads and start over. Present thread excepted, of course.
2007-08-31 8:39 pm
butters
Roman’s scheduler uses the same rbtree runqueue as CFS. This are O(1) for retrieving the leftmost task in the runqueue, which is the hot path. The insertion and removal operations are O(log N). This is the best algorithmic complexity that can be achieved in a fine-grained priority queue without consuming gobs of memory and trashing cache performance.
The runqueues used in the O(1) scheduler are arrays of linked lists of tasks. Indexing an array and inserting/removing from the head or tail of a linked list (as implemented in Linux) are both O(1). Computing the index is O(1) with the number of tasks but O(N) with the number of priorities, which is fixed at compile-time.
If the number of priorities is small, like the 100 levels in the O(1) scheduler by default, then this data structure is fine. But if the priority is based on something more continuous, like the amount of time due to a task, then an array is a very bad choice.
For one thing, an array that expresses the full resolution of time values would exhaust virtual memory all by itself. Even a rougher approximation of time would result in a large, sparse array with horrible cache performance.
The most efficient structure known to computer scientists for representing a collection sorted by a numerical key of considerable range is a balanced tree such as RB, AVL, or (for a sufficiently broad definition of “balanced”) splay trees.
The O(log N) operations are a classic performance vs. accuracy tradeoff. Furthermore, algorithmic complexity doesn’t correlate directly with performance. Depending on cache hotness, inserting a task into an rbtree can be significantly faster than for a priority array.
Finally, O(log N) is effectively constant across the (fairly narrow) typical range for tasks per CPU. This is one of the many advantages of per-CPU runqueues.

2007-08-31 7:25 pm
acobar
The whole history about such interest for a perfect scheduler is very entertainment.
First, I would publicly thanks Con Kolivas for attracting such huge interest for the topic. I also appreciate the hard work done by Ingo Molnar and I’m using it right now on a 2.6.22.5 kernel.
I’m not a specialist on that area but I am aware of how important and critical are good and trustable developers to a project to succeed. Even more on large projects like the linux kernel. That explains much of Linus behavior on the matter and on others too.
But I think Ingo got a bit far with the downplay game. Specially on this case. The argument of Roman was entirely based on math correctness and not on corner-case events. If he, Roman, is right, then it really means that his patches brings a quality rise to the scheduler code, independently of other possible gains. Lets not forget how important is to have something proved correct, if it is the case, instead of only rely on limited known cases.
Whatever happens, I believe we are going to win anyway.
2007-08-31 7:33 pm
jadeshade
Someone recently announced the SSS, or scheduler selecting scheduler, which switches between schedulers (at a tunable interval!). I forgot the author (it was on digg like a week back) but he was all like ‘it’s at least as good as the worst scheduler in the scheduling scheduler scheduler… schedule’.
coming in time for 2.6.23!
/tired of all of this, and just wants the laptop screen to turn back on when he opens the lid.
2007-09-01 3:31 pm
MrVain
You cant be good at everything. A specialized scheduler always outperforms a general one.
Solaris has multiple scheduling classes:
TS – traditional Unix timesharing scheduler
IA – a variant of TS that favors the window with the focus, to provide a more
responsive feel
RT – realtime classes
FX – fixed priorities
FSS – fair share scheduler, that tries to apportion CPU resources among
projects
SYS – only usable by the OS itself.
These can be tuned with dispadmin(1m).
More could presumably be added, if there were a compelling reason.
Certain combinations of usage may not make sense.
See also fx_dptbl(4), rt_dptbl(4), ts_dptbl(4), FSS(7), prctl(1), project(4).