How do you design a computing system to provide continuous service and to ensure that any failures interrupting service do not result in customer safety issues or loss of customers due to dissatisfaction? Historically, system architects have taken two approaches to answer this question: building highly reliable, fail-safe systems with low probability of failure, or building mostly reliable systems with quick automated recovery.
then u must have heard about http://www.eros-os.org/ “The Extremely Reliable Operating System” and their story about “Orthogonal Global Persistence” at http://www.eros-os.org/project/novelty.html :
A (true) story about keykos may provide some sense of the value of orthogonal persistence:
At the 1990 uniforum vendor exhibition, key logic, inc. found that their booth was next to the novell booth. Novell, it seems, had been bragging in their advertisements about their recovery speed. Being basically neighborly folks, the key logic team suggested the following friendly challenge to the novell exhibitionists: let’s both pull the plugs, and see who is up and running first.
Now one thing Novell is not is stupid. They refused.
Somehow, the story of the challenge got around the exhibition floor, and a crowd assembled. Perhaps it was gremlins. Never eager to pass up an opportunity, the keykos staff happily spent the next hour kicking their plug out of the wall. Each time, the system would come back within 30 seconds (15 of which were spent in the bios prom, which was embarassing, but not really key logic’s fault). Each time key logic did this, more of the audience would give novell a dubious look.
Eventually, the novell folks couldn’t take it anymore, and gritting their teeth they carefully turned the power off on their machine, hoping that nothing would go wrong. As you might expect, the machine successfully stopped running. Very reliable.
Having successfully stopped their machine, novell crossed their fingers and turned the machine back on. 40 minutes later, they were still checking their file systems. Not a single useful program had been started.
Figuring they probably had made their point, and not wanting to cause undeserved embarassment, the keykos folks stopped pulling the plug after five or six recoveries.
That is a great story. I’m an availability guy myself. It’s not like they are competing, necessarily, but with limited budgets and stiff requirements, I find that focusing on quick recovery and/or failover is the most efficient use of resources.
For example, no matter how reliable the system was built to be it still has to have a backup/recovery plan. But if it has a good enough backup/recovery plan, the significance of its reliability diminishes because the repurcussions diminish.
Actually I think it’s an “unfair matchup.” Availability is an end, and reliability is just a means to that end. Reliability, per se, isn’t relevant.
Oh, and another thing: when it comes to saving your job, hardware/software system failure is generally considered bad luck, act of God, fact of nature, etc. A slow recovery is always *your* fault. Hence, if the hard drive dies once a year but you always have the system back up in 4 minutes, you look like a stud. If you hard drive dies once in 8 years but it takes you three days to get everything working again, everyone is pissed at you and wondering if you were adequately prepared.
Yes it runs 24x7x365, but what price do we pay for it? Too much. Open VMS clusters are vay to go..
Back in 91 when I was a PFY using Novell, it would always recover well and unattended. We had several sites with intermittent power probs, and we didn’t always have UPS. But no probs, never had to go visit a site and do a manual recovery or tweak of any sort.
Funnily enough, most users were plugging in laptops, so they had UPS of sorts. So the lights would blink, and they’d have to wait a minute and relogin and that was all.
what mechanisms would a hypothetical OS provide desired levels of the following orthogonal goals:
1. availability – as discussed above
2. integrity – do you trust (1) information, (2) source of that infomation
3. accountability – non-repudiation
4. privacy – no information leakage
Why? well OSes are just balancing devices – they balance CPU and storage resources in ways that are optimal for your tasks. some OSes also balance “time” in additional to cpu and storage – these are known as real-time OSes.
Now – OSes would also include “security” as defined by the four above items as well as cpu, storage and possibly time.
Its interesting to think how such an OS/kernel would meet the above targets.
For privacy its could encrypt storage and communication flows – the more privacy required the deeper and stronger this encryption is.
For integrity, I guess there must be various certificates/tickets and chains of trsut as well as hashing checksums throughout the kernel?
Aavailability would use QoS-like schemes to schedule various sets of tasks – not just network i/o.
Accountability is a hard one – surely there is a better way than logging every kernel event?
Any ideas?
That was a great post!
Seems somewhat obvious. I want my pacemaker never to fail and to recover instantly. On my server fast recovery and unlikley failure is probably good enough.
The difference is:
She’s reliable if she always performs well, she’s accessible if she doesn’t have a husband/boyfriend/job.
she’s accessible if she doesn’t have a husband/boyfriend/job
..and that doesn’t even stop some from being accessible
>Reliability, per se, isn’t relevant.
I think there are circumstances where reliability is quite important: just think of a computer that controls rocket engines or stuff like that. No matter how fast the recovery is, it will probably be too late…
Nevertheless I agree that in most cases you’r right…
>Reliability, per se, isn’t relevant.
Good point. I guess what I meant was that reliability only matters as it affects availability. In an application where availability must be 100%, reliability might, practically speaking, be all-important. Nevertheless, if you could come up with a way to achieve truly instant failover, reliability would drop to not being important at all. It’s that sense in which reliability, by itself, isn’t the relevant thing.
I was wrong: restarts can be fast enough even for rockets:
>Fault-injection experiments have shown that restarts are >very fast, on the order of a hundred microseconds. A >network card can crash and be restarted almost unnoticed >as a transfer is in progress.
http://www.linux-mag.com/2004-10/xen_01.html
Really interesting stuff and worth reading. I hope we will see some cool features in the next linux distris…
As far as I know, Xen is already included in at least one of them (suse).
Reliable and available are related, but really seperate concepts. You can have systems that are 100% available, but very unreliable (like: never produce accurate results), and systems that are available 10% of the time, but do their function VERY reliable. But I guess above posters clarified that difference.
Just in case somebody missed it: development on the EROS project has been picked up again under the name Coyotos (www.coyotos.org). Damn, Eugenia, can’t you add the ‘anchor’ as allowed tags, so that comments can include clickable links, like everywhere else on this planet? (No offense intended, this is a great site! 😉
One more point: anything is only as reliable as the foundation it’s built on. You can have 100% reliable software, but it won’t run 100% reliable on crappy hardware. If you want quality system service, lay your foundation with stable running hardware. For more on this, read up on the concept of “leaky abstractions”: http://www.joelonsoftware.com/articles/LeakyAbstractions.html