No reboot patching comes to Linux 4.0

Thom Holwerda 2015-03-04 Linux 43 Comments

With Linux 4.0, you may never need to reboot your operating system again.
One reason to love Linux on your servers or in your data-center is that you so seldom needed to reboot it. True, critical patches require a reboot, but you could go months without rebooting. Now, with the latest changes to the Linux kernel you may be able to go years between reboots.

About The Author

Thom Holwerda

Follow me on Mastodon @thomholwerda@exquisite.social

43 Comments

2015-03-04 12:29 am
Drumhellar
That’s pretty cool. Is there limits on the level of patching that a running kernel can have?
Say, can the scheduler be patched, or even replaced with a new one?

2015-03-04 6:49 am
Lennie
Let’s not make it out to be something more than it really is.
It’s a way to do quick security fixes while in use and without a reboot/downtime.
Never meant to do large changes.

2015-03-04 6:56 am
Drumhellar
Well, yeah, but I am curious about the limits.

2015-03-04 7:54 am
Lennie
Don’t know if you are a programmer… but let’s try to keep this simple.
What is going is you have a memory address where a function starts. Other locations point to this memory address for the code that wants to call it.
They load a module which creates the new function at a different location (maybe make a backup of the original) and at that memory address of the original function they put a jump to the new location.
Now as you can imagine, if you want to make larger changes than just changing how a single function works…
well, things get complicated fast. 🙂
And it’s already more complicated because you can’t change the function while one of the CPUs has it in use I believe. So there is timing or locking involved.
Especially fun if you need to update more than a single function.
Edited 2015-03-04 08:00 UTC

2015-03-04 8:47 am
looncraz
And it’s already more complicated because you can’t change the function while one of the CPUs has it in use I believe. So there is timing or locking involved.
Especially fun if you need to update more than a single function.
I haven’t looked into how Linux 4.0 will be doing it, but my method would be to suspend scheduling momentarily, emplace the in-memory call redirection, then resume scheduling.
Using preemption to interrupt all running threads should take only a few milliseconds.
2015-03-04 9:36 am
Lennie
Well, actually, it’s sort of both methods. Because both were proposed. One by Suse and the other by RedHat.
And my description sucked, it’s more complicated, the Wikipedia descriptions are better:
http://en.wikipedia.org/wiki/KGraft
http://en.wikipedia.org/wiki/Kpatch
Later on the made a new patch, which supports both patch formats. The method it uses isn’t in the article:
http://www.infoworld.com/article/2862739/linux/the-winning-linux-ke…
2015-03-04 1:11 pm
oiaohm
Neither KGraft or Kpatch does data structures. The combined method does not do data-structures either.
https://www.suse.com/documentation/sles-12/art_kgraft/data/art_kgraf…
Section 7 here kinda explains why. You have nvidia or other unknown kernel modules loaded change a data structure is like playing chicken with a train. You know what one is going to lose.
Is there a solution. Kinda but its still horible. Hibernate combined with kexec and again hope you don’t become closed source driver road kill due to not supporting kexec.
Some people wonder why Linux people hate closed source drivers so much. When hibernation and kexec both end up bust at times because of them its kinda understandable.
2015-03-04 3:15 pm
Alfman verbose=1
oiaohm,
Is there a solution. Kinda but its still horible. Hibernate combined with kexec and again hope you don’t become closed source driver road kill due to not supporting kexec.
I was thinking this too: serialize the system state in running kernel (ie sockets, processes, pids, file handles, etc), kexec into new kernel, load previously saved state in new kernel. This would allow kernel upgrades to be almost arbitrary in nature. As long as both kernels supported the same serialization format (say XML), then the change in data structures would be totally irrelevant.
However this would be the slowest approach as a kexec into a new kernel while loading state might take 10-60s, in which case it’s a dubious benefit over a normal reboot (other than the fact that applications haven’t lost their state).
Edited 2015-03-04 15:16 UTC
2015-03-05 2:05 pm
Lennie
Actually 4.0 doesn’t include all the pieces and we have to wait for a later version before it can be used:
https://lwn.net/Articles/634649/
2015-03-04 1:22 pm
oiaohm
I haven’t looked into how Linux 4.0 will be doing it, but my method would be to suspend scheduling momentarily, emplace the in-memory call redirection, then resume scheduling.
Using preemption to interrupt all running threads should take only a few milliseconds.
The two different methods exist for a key reason.
Kpatch implements your method.
KGraft does not. KGraft takes a RCU method so scheduler does not stop.
Problem here is Pre-emption on a 4000+ core system might take several mins to complete.
Kpatch better for smaller systems. KGraft better for large systems. KGraft introduces some extra limitations on how kernel data structs can be altered.
So neither is 100 percent correct answer.
KGraft allows both the old and new function to be unsafe at the same time.
Linux kernel 4.0 supports both solutions.

2015-03-06 9:23 am
Fergy
Let’s not make it out to be something more than it really is.
It’s a way to do quick security fixes while in use and without a reboot/downtime.
Never meant to do large changes.
Wouldn’t it be much quicker to upgrade my OS? I am hoping someone more knowledgeable can tell me:
If you don’t reboot you can keep most of the data you need in memory. That could mean that Ubuntu does a distupgrade in memory, does a check, and stores the result to the harddrive. I would think that this would be much faster than downloading it to hd, installing it to hd and rebooting.

2015-03-04 1:29 am
philcostin
Over time, some bits might randomly flip in RAM – stuff that remains resident but is read frequently could slowly get corrupted anyway. (If it’s a server, you will want to be using EEC RAM, of course) – either way, the kernel won’t catch the change and undefined scenarios could play out after long uptimes.

2015-03-04 6:54 am
Drumhellar
Well, with ECC, single-bit errors are repairable, and double-bit errors are at least detectable. Linux kernels after 2.6.30 will also scrub memory looking for bytes that generate errors, and if the same location generates errors too often, it will mark the page as not-to-be-used.
Server gear frequently has BIOS options for scrubbing, as well, if your OS doesn’t support it directly.
This should pretty much eliminate silent corruption in memory.
There are already systems with decade+ uptimes.
For example, here’s a thread on Ars where a user had to shut down a NetWare server, because the bearings on one of the 5 1/2″ full-height hard drives was making too much noise. That system had 16 1/2 year of uptime.
http://arstechnica.com/civis/viewtopic.php?f=23&t=1199529
Of course, that’ 16 1/2 years without security updates. Ew.
Edited 2015-03-04 06:56 UTC
2015-03-04 6:06 pm
Wootery
True. I wonder what the mean-time-to-failure is, for crashes caused by random memory-corruption.

2015-03-04 9:08 pm
Vanders
No different to any non-patched system, and given I have personally worked on machines with uptimes measured in multiples of years, there’s no reason a patched kernel couldn’t maintain that level of uptime as well.
The hardware (especially power supplies or spinny rust) is going to fail before a random, uncorrected, bit-flip will cause a crash.

2015-03-04 1:49 am
Alfman verbose=1
This will inevitably be compared to ksplice, now owned by oracle. The benefits of ksplice couldn’t be practically realized without additional service contracts to provide well-tested kernel patches. Although one might assume this was simply due to oracle’s business model, but actually ksplice technology needed some help from developers to support specific kernels.
http://en.wikipedia.org/wiki/Ksplice
To be fully automatic, Ksplice’s design was originally limited to patches that did not introduce semantic changes to data structures, since most Linux kernel security patches do not make these kinds of changes. An evaluation against Linux kernel security patches from May 2005 to May 2008 found that Ksplice was able to apply fixes for all the 64 significant kernel vulnerabilities discovered in that interval. In 2009, major Linux vendors asked their customers to install a kernel update more than once per month.[8] For patches that do introduce semantic changes to data structures, Ksplice requires a programmer to write a short amount of additional code to help apply the patch. This was necessary for about 12% of the updates in that time period.[9]
So I wonder to what extent this new kernel patching technology is going to be automatic versus require tweaking by distros for each kernel?
2015-03-04 1:50 am
ihatemichael
How are the no-reboot patches in Linux 4.0 different from ksplice?
http://en.wikipedia.org/wiki/Ksplice
Didn’t ksplice already allow for this functionality?

2015-03-04 1:56 am
WereCatf
Well, I suppose the primary difference is that this doesn’t rely on proprietary tools nor require a subscription with Oracle.

2015-03-04 2:03 am
ihatemichael
Well, I suppose the primary difference is that this doesn’t rely on proprietary tools nor require a subscription with Oracle.
That’s cool, thanks. I was under the impression that ksplice was part of the Linux kernel itself.

2015-03-04 1:51 am
Brendan
Quite frankly, I doubt it’s entirely “risk free”. For example, if key data structures are changed (e.g. they’ve change something from a linked list into a hash table) I’d just expect it to screw everything up.
In a similar way; a blocked process has data on its kernel stack; and if any code is changed that relies on the data on a process’ kernel stack (including just updating the compiler without changing any kernel code) then I’d expect that to cause a massive disaster too (like, all processes crashing).
For something important (e.g. critical server), I wouldn’t trust this at all – I’d disable it and reboot when updating the kernel (note: I assume that if downtime is extremely important you’re using fall-over or something in case of hardware failure or hardware upgrade, and rebooting is even less of a problem than it would be for something “non-critical”). For typical desktop systems where it doesn’t matter if you reboot or not; I’m too lazy to figure out the mystic incantations needed and would just reboot in that case too.
– Brendan
Edited 2015-03-04 01:53 UTC

2015-03-04 6:52 am
Lennie
It’s similar to ksplice.
You use it for security updates. Not for new features.
If a memory structure needs to change because of a security update, it’s a more complicated patch and needs some manual coding by the people making the reboot-less-patches.

2015-03-04 9:09 am
krakal
It’s similar to ksplice.
You use it for security updates.
And self-contained bug fixes. But yeah, no kernel data structure updates allowed.

2015-03-04 9:27 am
Lennie
You can do kernel data structure updates, it usually does means someone has to make the kernel patch by hand.
At least that was how it was with ksplice

2015-03-04 1:50 pm
Brendan
It’s similar to ksplice.
You use it for security updates. Not for new features.
Right; but the article (and other articles) are saying silly/misleading things, like “never reboot again” (and aren’t saying realistic/practical things, like “temporarily postpone rebooting for minor security patches“).
– Brendan

2015-03-04 7:35 pm
Milan Kerslager
It will works for securiry updates – as by nature they are forced to only solve security problem and do not change anything else and not to break compatibility (ie. no data structure change). Mostly it means change some conditions (“if”) or use pointer instead of a value of the variable (this is C language). Your kernel hacker will tell you if the problem could be solved this way (we have them already as kernel package maintainers). And most of real enterprise distributions do patches/fixes/security updates this way (new features are introduced in a [half of a] year timeframe or so) – for example see https://en.wikipedia.org/wiki/Red_Hat_Enterprise_Linux#Version_histo…

2015-03-04 7:39 am
ameasures
First off: isn’t this a journalist known for being a bit of a MS shill?
Secondly: replacing bits of a running kernel is really microkernel territory surely. Binary patches into a monolithic kernel is only going to work part of the time – like when the incoming patch is smaller than the slot you push it into.
It could certainly be popular in data centres though foul ups could be monumental in scale. I am guessing there will be mechanisms for signage, encryption & verification so that the whole thing stays securely on it’s rails.

2015-03-04 9:04 am
daedalus
Well, surely if the replacement function is larger than the original function it can simply be placed elsewhere in memory and a JSR or similar placed in the original function location. That should work fine until the entire kernel is replaced at some date in the future. I don’t see why it really matters whether the function is part of a binary blob in the kernel or a separate module loaded afterwards once its location is known.

2015-03-04 9:16 am
SunOS
Haha, I dare say there’s a good bunch of Linux admins crying in their calomile tea that any dweeb can match their oh so precious uptime numbers now.
Hopefully this encourages them to apply security patches instead of putting their e-peen uptime ahead of everything else.

2015-03-04 2:10 pm
Soulbender
Uptime for individual servers is really a meaningless metric and have been so for quite some time now. Either your service is so important that you already have a failover or it isn’t in which case rebooting for an update doesn’t matter.

2015-03-04 9:20 pm
Drumhellar
It is even more meaningless when you get to the weird voodoo magic that is the IBM mainframe*, where you can literally replace every hardware component a piece at time without a second of downtime.
If you do that, is it the same server? It’s a modern day Ship of Theseus!
*Higher-end Sun gear used to be able to do this, don’t know about Oracle’s gear. HP Integrity can, too, I think, and others.

2015-03-06 11:34 am
oiaohm
Drumhellar it does not exactly have to be that high end. Its possible in some whitebox solutions.
When you get up into Intel Xeon and AMD Opteron you have motherboard interconnects. So a duel motherboard system yes this does have duel to quad power-supply(yes power supply bars so a single failed powersupply dues not disrupt things) can in fact hot-swap everything as long as you are very careful and very sure of what motherboard is in fact the active node.
The Higher end gear has nice ways of displaying active state and sometimes locks to prevent removal of active.
“It’s a modern day Ship of Theseus” kind of normally the core chassis in these systems normally don’t get replace or can be replaced. Some of the most insane IBM had fragmented chassis so it was possible to piece by piece change the chassis.

2015-03-04 12:59 pm
Hank
Back in the day the overall uptime of a system was a benchmark I loved to track. I enjoy reading stories of how many years a system had been on without being shut down. However in the modern era “the system” is often not one server but collections of servers. The state of the collection, shall we say cloud , of servers performing that task is what is important. Therefore rebooting to apply patches isn’t such a bad thing. It’s still cool to say that a system never *has* to go down. However I imagine it doesn’t mean that should be the MO.

2015-03-04 3:01 pm
Alfman verbose=1
Hank,
Back in the day the overall uptime of a system was a benchmark I loved to track. I enjoy reading stories of how many years a system had been on without being shut down. However in the modern era “the system” is often not one server but collections of servers. The state of the collection, shall we say cloud , of servers performing that task is what is important.
I agree, having services provided redundantly is the ideal solution, albeit not as practical for desktops/single servers.
With a redundant cluster, one could simply take each node offline one at a time. The cluster’s built in redundancy mechanisms should provide service seamlessly to customers. Now you can perform whatever maintenance you need on individual nodes. When they came back up the can join the cluster again and customers are none the wiser that it was ever down.
2015-03-04 9:12 pm
Vanders
However in the modern era “the system” is often not one server but collections of servers. The state of the collection, shall we say cloud , of servers performing that task is what is important. Therefore rebooting to apply patches isn’t such a bad thing.
That doesn’t always apply if you are, for example, the cloud provider. AWS (& other compute clouds) have had to do a fleet-wide reboot to patch Xen issues twice in the past 12 months. While it’s easy to say “Well, customers shouldn’t rely on any single instance”, the reality is that customers do get grumpy when instances go down. Also when you’re the size of AWS you can’t spend weeks per. AZ slowly rebooting your thousands of virtualisation hosts.
You’re right, though, that the normal procedure should still remain to install a new kernel & reboot.

2015-03-05 6:01 am
Soulbender
have had to do a fleet-wide reboot to patch Xen issues twice in the past 12 months.
Uh, are you sure? We have a load of servers on AWS and none of them has had a reboot (that wasn’t done by us) in the last 12 months.

2015-03-05 1:47 pm
Alfman verbose=1
Soulbender,
Uh, are you sure? We have a load of servers on AWS and none of them has had a reboot (that wasn’t done by us) in the last 12 months.
I had read this too, but I was under the impression it was just some of the older nodes. Still, I have to wonder why live migration wasn’t available? Was there a breaking change that prevented live migration from being used?
Note I’m not an amazon EC2 customer, I’ve only trialed it. For my needs, I’d require virtualization be available to me, which would require nested virtualization to work on EC2 nodes and I don’t think it does. I’m reading that there are third party hyper-visors that work around this limitation and run on EC2:
http://www.virtuallyghetto.com/2013/08/ravello-interesting-solution…
http://www.ibm.com/developerworks/cloud/library/cl-nestedvirtualiza…
In my own lab, Linux nested virtualization used to work (via KVM-Intel), but was broken last year and it hasn’t worked for me since (using stock kernels).
https://www.mail-archive.com/kvm@vger.kernel.org/msg111503.html
http://www.spinics.net/lists/kvm/msg112133.html

2015-03-05 7:47 pm
Drumhellar
I had read this too, but I was under the impression it was just some of the older nodes. Still, I have to wonder why live migration wasn’t available? Was there a breaking change that prevented live migration from being used?
EDIT: Found a comment on Ars that goes into detail about other potential issues with Xen migration (Which is what AWS uses):
Xen implements hot movement of virtual machines (Live Migration in “Xen speak”, vMotion in VMware ESX speak) differently than VMware. VMware uses memory and IO mirroring to perform the migration of the operational guest data and at the time of the pivot, will stun the guest (pause all operations of all types), lock the underlying virtual disks so that nobody else can make any types of changes to them, finish the transfer of unsyncronized memory, flush the CPU cache and transfer that state, transfer control of the locked disks to the target, then release all of that in the opposite order on the target and unstun the machine.
While Xen does most of the same stuff for memory transfer, CPU state transfer and portions of the control handoff, Xen doesn’t utilize SCSI reservations or storage assisted accelerated locking to prevent disk conflict during the migration. Instead, it relies on a master node in the overall cluster that handles all meta-data modification. This master node can be somewhat fragile compared to the VMware model and it has taken a long string of iterations by Xen/Citrix to get the process working reliably on a range of network infrastructures. Having the master node in the cluster go down will prevent any other guests from being booted as well as prevent any attempts at Live Migration. Until the master boots back up or another node is elected the master, the cluster is in an inconsistent and degraded state. Additionally, Xen tends to be very picky about versioning for purposes of joining the cluster and having major differences in versions can cause major changes in the API that handles this functionality.
Even if Amazon had live migration enabled and working, they may be in a situation where they have some of their clusters at older versions that can’t be seamlessly upgraded to the fixed version due to API incompatibilities. This has been required between earlier major versions of XenServer such as 4.0 to 4.1, 4.1 to 5.0, etc. it wasn’t possible to roll through the cluster and perform upgrades because the changes were too big to provide backwards compatibility from new version to old and the server running the new version wouldn’t join the cluster as a target for guests.
Yay.
Edited 2015-03-05 20:03 UTC
2015-03-05 8:16 pm
Alfman verbose=1
Drumhellar,
From what I understand about EC2, many instance types utilize local storage, so doing live migrations of all the affected instances would be murder on the network. I may be wrong, though.
That makes sense to me. Probably would have been feasible to throttle the migrations but it would take a long time and they figured a reboot was better for them than migrating all the bits around the network/drives.
2015-03-05 8:18 pm
Drumhellar
Oops. I accidentally edited that quoted part out of my post after I found a larger, more informed post on the subject.
2015-03-05 8:24 pm
Alfman verbose=1
Drumhellar,
EDIT: Found a comment on Ars that goes into detail about other potential issues with Xen migration (Which is what AWS uses):
Very informative! Makes me wonder why Xen didn’t always have upgrade paths in place. Each version should always be able to migrate to the next. I guess they didn’t enforce this, but I hope they have this sorted for the future!

2015-03-04 2:29 pm
ihatemichael
How do you actually use this?
Is there something I need to do from userspace to use kpatch/kgraft?
Or distros need to add the support?
How does it work?
2015-03-04 8:25 pm
Munchkinguy
This headline needs a hyphen. “No reboot patching comes to Linux 4.0” means that Linux 4.0 will not have a reboot patch. “No-reboot patching comes to Linux 4.0” means that Linux 4.0 will allow people to install patches without rebooting.
2015-03-05 8:22 pm
Milan Kerslager
In the small one-man-driven-server-world I’m able to reboot any machine at any time I want to (well, mostly – on Samba server for 600 people I simply cannot reboot until 11pm). But in the real business there are more admins and rules what could be done and when it could be done. Reboots are planned ahead of such time (say once per week). But there could be a security fix in the kernel I want to apply even the scheduled reboot is far in the future. And there is a welcome place for no-reboot patches for me to save my ass, because if my server will be hacked, it is over my responsibility. But still I take responsibility over “unnecessary” reboot (and my boss do not understand that our server was not hacked because I did the reboot out of schedule because there is no proof about this I’m able to present in front of his boss).
Edited 2015-03-05 20:25 UTC