Another great little story from The Old New Thing.
At one point, the following code was added to the part of the kernel that brings the system out of a low-power state:
; ; Invalidate the processor cache so that any stray gamma ; rays (I'm serious) that may have flipped cache bits ; while in S1 will be ignored. ; ; Honestly. The processor manufacturer asked for this. ; I'm serious. ; invdI’m not sure what the thinking here is. I mean, if the cache might have been zapped by a stray
gamma ray, then couldn’t RAM have been zapped by a stray gamma ray, too? Or is processor cache more susceptible to gamma rays than RAM? The person who wrote the comment seems to share my incredulity.
The invd was commented out a few weeks later, but the comment block remains in Windows’ kernel code to this day. Amazing.
There is a reason NASA and other space agencies often use silicon a decade or more older, and it’s not just because they know how it behaves.
So, it is actually feasible that the smaller semiconductor details become the more likely they are to be zapped by stray radiation from any number of sources.
We can’t have it both ways, we can’t claim gamma rays caused our cancer, and then claim gamma rays don’t affect our semiconductors. The only really question is then, does it make sense given a rewrite source might also be corrupted, and what is real world likelihood of it happening on a critical system!
btw., For those who think hardened components means shielding, a gamma ray will penetrate about 15 metres of water, 2 metres of concrete and 0.5 metres of lead!
Edited 2018-11-21 02:53 UTC
It’s not that they are more likely to be hit but it is easier for bits to be flipped due to the lower energy required to do so in newer chips as the process technology shrinks.
At least AMD’s CPUs and GPUs have error correction on chip as well as for the off chip memory. Vega20 notably has some error correction enhancements (ECC added to GPU cache).
For instance Ryzen is suppose to have error correction on all levels of cache and potentially other areas also.
Leon Sparc at least the version manufactured by Atmel has extreme hardening and is rated to withstand fairly high levels of radiation without failure. Those run around 100Mhz and would be used anywhere thier operation was critical for safety… commodity AMD or Intel CPUs could be used for non critical stuff though without too many issues ECC would probably make it at least usable.
Those Atmel chips are rated for 300kRad which is 30x higher than a fatal full body dose.
Edited 2018-11-21 03:16 UTC
NASA isn’t the only one that worries about stray gamma and cosmic ray causing bit flips. This is a problem with HPC and defense setups, too. The reason is when you have a 99.99% reliability, that still adds up when the programs you run last months or continuously running for years. One hiccup can crash a program that has already been running for months.
This is the reason space probes, critical defense systems, and the like have at least triple redundancy. Calculations are made in threes. Two computers agreeing makes for an acceptable result. This reduces the likelihood of unique hardware glitch or stray radiation event causing a fatal error and a billion dollar mission failing. Or worse, a stray event alerting to a bogus ICBM launch.
It’s interesting someone requested such a feature in Windows, but it’s kinda odd that they were trying a poorly reasoned software fix for what’s fundamentally a hardware problem and requires hardware redundancy to truly compensate for.
Edited 2018-11-21 04:16 UTC
Yes, hardening is all about redundancy when it comes to high energy physics and computers.
I’m not sure MS specifically asked for this, I suspect it was a request relayed from the chipmaker.
The request is not really surprising, we build whole facilities so that traders have equal chances on the stock market. With ever increasing thrifty and fast programming, a flipped bit could change a buy to a sell if not for redundancy! A nanoSecond might cost you a trade and that is just 30cm of fibre optic!
Over in Europe they thought neutrinos went faster than light just due to loose fibre optic connector, if only they had some redundancy on that cable!
Edited 2018-11-21 04:33 UTC
“Working around shitty hardware in software” has been a common theme of x86 operating systems since the 1980s. Consider the triple-fault necessary to drop to real mode on a 286, or the endless parade of hardware/firmware failures papered over in Windows drivers and uncovered by Linux drivers naively assuming the hardware behaves according to spec.
tidux,
Oh boy did those PCs have teething problems! Every generation we hit arbitrary limits with DOS, ram, IDE disk drives, bios APIs, DMA/interrupt lines, etc…
It took IBM/clones + intel a long time to sort this all out because time after time they resorted to bit packing with practically no room for future-proofing. The result is an architecture full of legacy baggage and structures that have integers chopped in half and stuffed where-ever they fit in the reserved space.
https://en.wikipedia.org/wiki/Global_Descriptor_Table
I can appreciate that they didn’t anticipate x86’s importance at the beginning, but I feel that by the time of the 386 they should have learned their lesson. Instead they repeated mistakes that would have been foreseeable & preventable if only they bothered to think about the next generation.
Oh those old times. As much as I begrudge some of the engineering decisions back then, I was actually pretty fond of programming anyways
Edited 2018-11-21 08:25 UTC
The PC wasn’t designed to be future-proofed. IBM designed it much like other manufacturers of the time, with no intent on backwards compatibility. It’s a bloody miracle we’ve managed to stretch it this long.
The123king,
I disagree on backwards compatibility. For all our gripes about the mess we’ve inherited from the past, I think they went above and beyond with backwards compatibility. It’s a bloody miracle we can still run 35+ year old operating systems on modern hardware!
Edit: I’m rereading your post, and I think you meant they had no intention of supporting the architecture in the future, but I’d probably call that forward compatibility rather than backwards compatibility.
Edited 2018-11-21 09:50 UTC
Yes, that’s indeed what i meant.
Never seen anyone claim that. Is that a common claim ?
I guess it might be possible in theory… but just never seen anyone claim it.
I suspect most of the common media refer to all sorts of extraterrestrial radiation as comic rays, while researchers differentiate cosmic rays, gamma rays, x-rays, etc., etc., as well as terrestrial versus cosmic sources.
There are many studies looking at the effects of terrestrial gamma ray flashes(TGFs) versus cosmic gamma ray bursts(GRBs). TGFs are very short term events lasting milliseconds or microseconds often associated with cosmic rays impacting the atmosphere, lightning, sprites, jets or elves. Cosmic gamma burst can last many seconds and are from stellar sources. There is even some evidence a coincident GRB might have contributed to the extinction of dinosaurs by partially stripping earths atmosphere.
The people mostly at risk from TGFs are pilots, given their reluctant but frequent proximity to the most common sources. I gather it’s no accident modern aircraft use triple redundancy. What would be really interesting to know is how often is comes into play to break a deadlock, I bet somebody knows!
Edited 2018-11-22 00:17 UTC
While I’m glad you gave me all this information, I’m not so sure it answered my question specifically about cancer. Seems you are saying many are looking into things like that and it does have an effect on some things, so cells could also be one (leading to cancer).
“I’m not sure what the thinking here is. I mean, if the cache might have been zapped by a stray gamma ray, then couldn’t RAM have been zapped by a stray gamma ray, too?”
It could yes, but main memory is built on a larger process with harder to flip bits as well as potentially having ECC on main memory but not on the cache which is probably the real reason for this change request. Even so it would probably only make a real difference for high reliability systems… you’d see the problem on commodity hardware just as often but either it wouldn’t be in a critical area, or if it was it would just hang or cause some other error and a quick power reset would have you going again.
The tradeoff is that whatever was in cache before going to sleep would be flushed, meaning more power wasted having to reload things from cache and slower execution right after coming out of sleep. Even so, if the system’s cache didn’t have ECC this seems like it would have been a perfectly fine feature to put in as it would be very infrequent for this cache invalidation to occur.
Edited 2018-11-21 03:26 UTC
I think it is more to do with the fact that cache memory is SRAM, and an SRAM cell is a bi-stable network of gates: push a transistor the wrong way, and the cell flips over using energy from its own power supply. DRAM is not like this — an individual DRAM cell is much smaller than an SRAM bit, but a gamma ray would have to pump it up most of the way from zero to one to damage information; partial bumps would be nulled when the next refresh came only a few milliseconds later.
(edit: This is just speculation based on my understanding of this technology which is debatable but not null).
Edited 2018-11-21 14:32 UTC
Lobotomik,
I’ve always wondered about the odds of “cosmic ray corruption” were in relation to different system components.
Be it bi-stable sram or dram, it would also seem to me that the less energy is used to represent a state, the more likely a component could be vulnerable to corruption. In theory, it could happen anywhere inside or outside the CPU, including the bus or wires, but I suspect the capacitance (ie billions of electrons) sort of mitigates the problem for larger components.
Some CPUs have ECC to correct cache errors, so I don’t know if cache errors happen in practice or if it’s just another feature to sell on high end servers?
https://www.techarp.com/bios-guide/cpu-l2-cache-ecc-checking/
https://en.wikipedia.org/wiki/ECC_memory#Cache
Regardless, I’ve yet to see any ECC corrections in my servers, so I think the problem is relatively rare. However it could depend on local environmental factors. If you’re on the bottom floor of a building with 40 servers stacked on top of yours, does that make a difference? That would be interesting to find out.
Edited 2018-11-21 15:32 UTC
“Hey! Who cares if there’s VCSs available where we can look at code in history? Let’s just comment out these 150 lines that are not needed anymore so that we keep the visual noise level high enough.” Gotta LOVE that mindset.
https://blogs.oracle.com/linux/attack-of-the-cosmic-rays-v2
Basically, the writer was able to find a flipped bit in RAM that _might_ have been caused by radiation.
It’s called an SEU (single event upset: https://en.m.wikipedia.org/wiki/Single_event_upset
Not that exotic, I’ve seen it happen. Vendor had to put their gear into a cyclotron to test workarounds. Data centre conditions, not space.
There were problems back in the day with cache being unreliable in the past while RAM was ok:
http://www.sparcproductdirectory.com/artic-2001-dec-1.html
Wonder how many other companies had similar issues.
Edited 2018-11-21 10:03 UTC
First, lets remember that what we call a stored bit is actually a huge (really huge) number of electrical charges or magnetic fields on some components (be it capacitors, magnetic materials or transistors) so, a “stray gamma ray” will do nothing, it must be a “focused” source of many (really many) to cause a serious harm (or else our brain would also not work and, a bit similar but different, our genetics would fail too).
As pointed, it may be critical to huge data keep for a very long period of time, as probability and statistics conspire against them, but for us that turn computers on and off all the time and are always modifying data, the risk is a lot lower.
Please, don’t panic.
If you have a huge collection of data you want to keep for many years, backups/rewrites (thought rewrite degrades a bit your storage media every time it happens) are your friends.
Maybe the meant cosmic rays. Cosmic rays are much more energetic.
Really, people like to use this argument here on Earth surface and down here production defects and wear effects are much, much more important. I don’t have the statistics, but I remember to read about this some years ago. Perhaps, I may try to find them.
NASA, of course, is much more concerned because the dosage of cosmic rays and gamma rays far away from our protective atmosphere is many orders of magnitude larger. A couple of years ago I read that they also had hardened processors built on purpose by IBM, and the whole circuits on boards were also hardened an there also exist, of course, shields enveloping the most sensible parts. And lets not forget redundancy. One form of hardening is, of course, the use of bigger scales for circuits, it is probably one of the reasons they use old node process.
What we call cosmic rays is actually composed by a large number of particles. Some of them are very energetic but are easily stopped, like alpha, some a little less, like beta, or other smaller ones. So, it is not a question of just how energetic but a balance of how penetrating and how ionizing (with cascade effects, secondary particles and all) the thing can be.
There are also some important effects related to surface charging but, I guess, the whole thing is projected to isolate the main parts from this.
acobar,
I don’t think anyone would disagree with you about process defects, obviously addressing that is very important. However that doesn’t really address the effects of various forms of naturally occurring radiation on electronics.
I see lots of info about the theoretical effects, but I don’t see much in terms of “this is what you can expect in practice on modern consumer hardware”.
https://en.wikipedia.org/wiki/Radiation_hardening
If you can find statistics for modern electronics, that would be great! I’d like to learn more about it for sure.
Edited 2018-11-21 17:02 UTC
Have some fun reading:
http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf
https://pdfs.semanticscholar.org/721a/093aba13979c674878c4076fa28dc8…
https://arstechnica.com/civis/viewtopic.php?f=8&t=802054
Note my commentary about sea level and our beloved atmosphere.
acobar,
That’s interesting, thanks. The google paper suggests most errors that might be attributed to soft fault (ie cosmic radiation) are actually intermittent hard faults. The fact they are intermittent clearly makes them hard to detect by manufacturer’s quality control. 8% of the DRAM they tested showed these defects every year. This could explain why some of us have faults while others do not, it’s a luck of the batch. Most people are lucky and have no faults, yet a significant number do have faults.
It makes me wonder if they can build DRAM to better tolerances? Is there a strong correlation with serial numbers? Maybe there could be a way to deliberately provoke intermittently faulty RAM into producing an error at the factory before it is sold?
I’d rather not have ram that’s intermittently bad obviously, but given that hard errors are fairly common perhaps operating systems should have the ability to blacklist bad addresses like we do with disk sectors.
On Windows I don´t know, on Linux it is definitively possible. Some years ago I had a problem with an Atom netbook that I liked because it was really lightweight, so I did some research to keep it useful, in the end, I opted to buy RAM. Look for “memmap” use to blacklist bad memory, I´m sure you are going to find lots of info.
As normal our last discussion is closed. So I thought I post some more about economics here 🙂
https://www.youtube.com/watch?v=lq3s-Ifx1Fo
http://www.volatilityinvesting.co.uk/sites/default/files/Stupid~*~@…
While I don’t know if everything is correct. What was interesting to me is how he says the market has moved so much in lock-step and that is why the financial industry ended up moving more stuff in risky stuff and CDOs, etc.
Lennie,
Yes, if only more people in the US would go into politics which refuse to take the big bribes.
Lennie,
Well, I think it’s a problem of fitness functions. Those who do not lean that way have limited opportunities. The winning formula is to *say* you are for the people and against the swamp (to win the votes), meanwhile you nevertheless partake in the very same corruption and promotion of big money interests. For better or worse, this is how politics works; those who lie and partake in the corruption are more “fit” for politics than those who are honest and put integrity first.
Politics has evolved to produce corruption in part because we don’t have a fitness function that eliminates it. There’s just no accountability.
Maybe the current situation is a reflection of the population, and as such quite representative… (how many of the electorate would really say no to their own slice of political cake, given to chance to participate?)
PS. And Alfman seems to almost/kinda suggest (writing how “Politics has evolved to produce corruption”) that it was better …but ~corruption was probably more common in the past – it’s just what the ruling classes did (historically, they employed quite ~corrupt means to keep themselves in power)
Edited 2018-11-29 00:03 UTC
Will the kernel of Windows and/or the Visual C++ compiler ever be published under an OpenSource-license?
The Visual C++ Compiler is in that case important, because the Windows Kernel is compiled with it.
And the Build-Tool, which is used for the kernel, MSBuild (https://github.com/Microsoft/msbuild/) is already OpenSource.
And the Windows Driver Framwork (which is a little part of the Windows Kernel) is OpenSource, too.
https://github.com/Microsoft/Windows-Driver-Frameworks
https://www.osr.com/blog/2015/03/18/windows-source-code-now-github/
It is interesting, that Windows is the last big operating system, which is (nearly) completely closed source.
Even macOS have an OpenSource kernel (xnu), tools which are based around it (darwin) and the compiler, which compiles it (llvm/clang). All OpenSource.
And the other systems, like Linux, *BSD, Haiku, etc are completely opensource.
Only Windows is (nearly) completely closed.
That is also an security problem. Because security by obscurity don’t exist.
The bad guys are decompiling Windows and finding security holes in Windows, which they can use, to write virus or worms.
But the good guys are prevented by the license, to look at it.
I know, Microsoft doing a lot of in OpenSource: Visual Studio Code, .NET Core, etc. And all running on different operating systems.
The funny thing is, that the tool ProcDump for linux is opensource
https://github.com/microsoft/procdump-for-linux
where the Windows version – like all Windows SysInternals – don’t allow reverse engineering, decompiling, making too much copies of the program, publishing the software for others, etc.
https://docs.microsoft.com/en-us/sysinternals/
https://docs.microsoft.com/en-us/sysinternals/license-terms
And the next big thing is ProcMon for Linux
https://twitter.com/MarioHewardt/status/1058900124145860608
So, there are good reasons to using some Microsoft-products like VSCode, .NET Core, etc. But there existing no good reasons to use them on Windows.
As I said, Windows is one of the few operating systems, where neither the kernel, nor the compiler with which it was build is open.
Three years ago (in the year 2015), it was “definitely possible”, that it will be OpenSource.
https://www.wired.com/2015/04/microsoft-open-source-windows-definite…
And today?
Interestingly the first github-submit of MSBuild and the WindowsDriverFrameworks are from 2015, too.
Open Source is not a better solution for security.
Just create an Open source software with big holes, be sure you can use it safely everyday. Because nobody cares about what you did, in fact.
Now, hundreds of persons use your software. How many of them will care / look / understand / analyse your code?
If you keep your code for you, will your app be less secure?
The main difference is that some random guys could analyse your code and may send you some merge requests. Or they can exploit the security holes they found…
Same thing for the closed source: random guys may informs you they found security holes. Or exploit them…
The security of a software depends on what people do with security hole they find, not on the way they find them
No. Security of software is based on many complex factors, the open/closed off the source is an important factor. It allows honest security researchers to assess the architecture,design and implementation of security within the software.
I should note that Microsoft *does* offer the chance for companies to review the source code for windows and many other products under a very restrictive read only license. So it does get some benefit that open source would also allow while not being completely open.
Edited 2018-11-21 17:13 UTC
Absolutly. There is an agreement with external people. The security is maintained by the fact that if a company use what they saw in the source code, they’ll have to pay. A lot.
The security of Open Source code rely on the honesty of people who read the code, we said it.
Don’t misunderstand me, I don’t say closed source is better than open source (nor reverse). As long as there is more honest people that’s fine.
RAM on these architectures that require error correction have usually several layers of ECC and buffered RAM, but the CPU cache does not have that capability (that I am aware of), therefore that makes perfect sense for hardware that requires that type of precision.
Somebody had observed rare ‘buggy’ behaviour and just wanted a version they could test that would rule this particular problem out?
Pure speculation….
Hi,
The INVD instruction doesn’t write data back to RAM; so if the cache contains the most recent copy of some data (and the RAM contains an obsolete version), INVD will discard the most recent copy and cause the data to reverted to the obsolete copy in RAM (essentially, causing “random” data to be corrupted because there’s no way to tell what will be in cache and how old the copy in RAM might be).
The correct method would be to disable caches (to prevent data from being “re-cached” immediately after its invalidated), then use WBINVD (to write the current copy of data back to RAM and invalidate the cache), then enter S1 state. That way while the CPU is in the S1 state there’s no data in cache to be corrupted by gamma rays. Of course when leaving S1 state you could INVD (knowing that there can’t be any valid data in cache) in case a cache tag got hit in the exact right place (causing cache to think it contains data when it doesn’t) before re-enabling caches; and in that case the INVD “should” do nothing.
Mostly; if you take educated guesses about the surrounding code (extrapolate from the “broken in isolation” snippet); it doesn’t sound like a bad idea at all – either (extremely likely) the INVD does nothing or (extremely unlikely) it fixes a real problem.
Note: RAM corruption isn’t relevant for this issue; and the “don’t fix problem A because there’s a separate and unrelated problem B” logic people seem to be using is flawed. A better argument would be that the performance cost of INVD isn’t justified by the (extremely unlikely) chance that it’ll help; but I’m not entirely sure how much INVD costs for the “cache contains nothing anyway” case and I doubt computers are entering/leaving S1 state so often that it makes a practical difference.
– Brendan
Brendan offers a very balanced perspective.
Further on the redundancy issue, in programmable logic it’s not uncommon to ask for a process to confirm a signal multiple times before accepting it. In critical systems I know engineers who will perform a read ten or more times before accepting the signal is real, in some applications you might never get a second chance!
On the performance cost, engineers do not choose to use this amount of resource without good reason. I can understand why a game programmer might argue against it, but it’s hard to imagine someone developing the next defibrillator or some other critical application.
I expect someone will make a smart comment about a Windows defibrillator, but I stand by my assertion that it is not a MS Windows issue, this was a recommendation coming from the hardware developer that found it way into code for very specific reasons. But because it has been found in some legacy code with no immediate everyday use MS is a popular and easy target!
With this mindset, avoid Spectre and Meltdown mitigation because of the low potential risk yet big performance impact.
That’s the same stupid debate between monolith vs micro-kernel architecture. “Boohoohoo, 5% performance drop down is unbearable” but accepting your whole system being taken down in case of problem is “acceptable” ?
If INVD have such a low impact on performance, yet *might* secure processing, why even hesitating ? Copy it everywhere and replace all the useless nops with it !
An entire article about coding to deal with gamma rays and no mention of The Hulk or Bruce Banner?