Creating a Fault-Tolerant Environment in Windows Server 2003

Eugenia Loli 2004-08-08 Windows 7 Comments

There are many ways to add fault tolerance to network services and resources running on Windows Server 2003 servers, and all without the hassle of third-party software. Find out how to use them.

About The Author

Eugenia Loli

Ex-programmer, ex-editor in chief at OSNews.com, now a visual artist/filmmaker.

Follow me on Twitter @EugeniaLoli

7 Comments

2004-08-09 2:21 am
Anonymous
Think about it, do u really want fault tolerant software? I think you mean something else, but everyone keeps saying it. I’d rather have bug-free, error handling software than fault tolerant anyday.
2004-08-09 3:09 am
Anonymous
Did you read the article?
Your code could be 100% bug free, but how does that help it handle a newtork outage? a failed disk? or any other number of eviromental issues that exposue themselves to a computing eviroment
Yes I *do* want fault tolerant software….
2004-08-09 3:24 am
Anonymous
There are numerous ways to increase the availability of a service (that’s really what you are concerned with – the service that runs on the compute platform).<p>
Fault tolerance (as a method of increasing availability) in systems goes far beyond disk mirroring, volume shadowing, network load balancing and other topics in this article.
REAL fault tolerant platforms (read: Tandem, Stratus etc.) mirror the hardware/OS/applications in real time. Internally, these systems have hardware subsystems that monitor hardware faults and switch the operational node state in near real-time. Tandem (for example) have a 3-way system, where one system is working (HOT), one is protect (HOT) and one system is standby (COLD). Operational switchover is measured in microseconds.
If your application is written against their libs or can be easily wrapped with their monitoring services, you can extend this type of behaviour up to the software stack. Tandem’s will (by default) fault/switch on OS malfunction — extending that into the application space requires some custom glue & imagination.
These systems are used in situations where downtime is not an option. When someone says to me “fault tolerant” and in the next breath talks about cluster services and volume mirroring – it’s immediately apparent that they have never seen a fault tolerant system.
Many of our fT platforms run on DC directly (no UPS to die) and are driven directly by large battery strings that can keep the service up for a week without the diesels kicking in. Externally – the facilities are a minimum of dual *everything*. Dual transport, dual provider (LEC/ILEC/IXC)….
We measure uptime in years for the services that run on these systems. Everything is hot plug CPU, memory, disk, network, I/O. We even changed out a bad backplane in one (once upon a time). Migrated the service – changed out the bad backplane. Tested it. Left it in place. No need to migrate back. We have atleast one system that (as of today) have been running 24x7x365 for over 11 years.
Do a Google on Tandem+HLR …. to get an idea of what a fault-tolerant system is, why it needs to be fault tolerant and why this article does a very poor job of describing fault tolerance, yet alone educating anyone on the topic.
2004-08-09 5:19 am
Anonymous
All valid points, but the article points to not using 3rd party software, and that it isn’t the only way to achieve fault tolerance.
Spouting a ‘better’ way to achieve tolerance is irrelevant. If an organization requires fault-tolerance they need to do a risk assessment, and let *that* dictate what technology they roll out, if anything at all. Picking technology before doing the paperwork just ends up costing people time and money.
2004-08-09 7:07 am
Anonymous
but does microsoft provide you with a journaling filesystem which is able to “repair” itself?
i mean it happened one to many times restoring or even reinstalling the mswindows os and data after a power cut (dont tell me to use a ups when the psu fails).
sorry to say windowsnt4 even failed c2 certification in a networked situation (standalone its ok)
also check out http://www.eros-os.org/
2004-08-09 8:21 am
Anonymous
> REAL fault tolerant platforms (read: Tandem, Stratus etc.) mirror the hardware/OS/applications in real time … switchover is measured in microseconds.
They have already introduced server virtualization in W2k3, so I suppose such functionality may not be that far behind. *In theory* at least, it shouldnt be too difficult to have three hot copies of the OS running as virtual servers, ready to take over once hardware/stability problems are encountered, ready to switch-over, but yes, this sort of a thing seems to be at least a few years into the future for now. Well, free’s free though, and not having to buy third party utilities to get high availability at least seems like a positive first step.
2004-08-10 1:32 am
Anonymous
but does microsoft provide you with a journaling filesystem which is able to “repair” itself?
In theory, yes – just like any other (non-data) journalling FS – your NTFS filesystem should never be corrupted (data might be lost).
i mean it happened one to many times restoring or even reinstalling the mswindows os and data after a power cut (dont tell me to use a ups when the psu fails).
Sounds like you’ve got other issues if it’s happening regularly.
sorry to say windowsnt4 even failed c2 certification in a networked situation (standalone its ok)
IIRC, C2 *requires* no network – so only standalone machines can be rated C2 (or could, it’s been obseleted I think).