Microsoft’s CrowdStrike post-mortem

Thom Holwerda 2024-07-29 Windows 29 Comments

Microsoft has published a post-mortem of the CrowdStrike incident, and goes into great depths to describe where, exactly, the error lies, and how it could lead to such massive problems. I can’t comment anything insightful on the technical details and code they show to illustrate all of this – I’ll leave that discussion up to you – but Microsoft also spends considerable amount of time explaining why security vendors are choosing to use kernel-mode drivers.

Microsoft lists three major reasons why security vendors opt for using kernel modules, and none of them will come as a great surprise to OSNews readers: kernel drivers provide more visibility into the system than a userspace tool would, there are performance benefits, and they’re more resistant to tampering. The downsides are legion, too, of course, as any crash or similar issue in kernel mode has far-reaching consequences. The goal, then, according to Microsoft, is to balance the need for greater insight, performance, and tamper resistance with stability.

And while the company doesn’t say it directly, this is clearly where CrowdStrike failed – and failed hard. While you would want a security tool like CrowdStrike to perform as little as possible in kernelspace, and conversely as much as possible in userspace, that’s not what CrowdStrike did. They are running a lot of stuff in kernelspace that really shouldn’t be there, such as the update mechanism and related tools. In total, CrowdStrike loads four kernel drivers, and much of their functionality can be run in userspace instead.

It is possible today for security tools to balance security and reliability. For example, security vendors can use minimal sensors that run in kernel mode for data collection and enforcement limiting exposure to availability issues. The remainder of the key product functionality includes managing updates, parsing content, and other operations can occur isolated within user mode where recoverability is possible. This demonstrates the best practice of minimizing kernel usage while still maintaining a robust security posture and strong visibility.
Windows provides several user mode protection approaches for anti-tampering, like Virtualization-based security (VBS) Enclaves and Protected Processes that vendors can use to protect their key security processes. Windows also provides ETW events and user-mode interfaces like Antimalware Scan Interface for event visibility. These robust mechanisms can be used to reduce the amount of kernel code needed to create a security solution, which balances security and robustness.
↫ David Weston, Vice President, Enterprise and OS Security at Microsoft

In what is surely an unprecedented event, I agree with the CrowdStrike criticism bubbling under the surface of this post-mortem by Microsoft. Everything seems to point towards CrowdStrike stuffing way more things in kernelspace than is needed, and as such creating a far larger surface for things to go catastrophically wrong than needed. While Microsoft obviously isn’t going to openly and publicly throw CrowdStrike under the bus, it’s very clear what they’re hinting at here, and this is about as close to a public flogging we’re going to get.

Microsoft’s post-portem further details a ton of work Microsoft has recently done, is doing, and will soon be doing to further strenghthen Windows’ security, to lessen the need for kernelspace security drivers even more, including adding support for Rust to the Windows kernel, which should also aid in mitigating some common problems present in other, older programming languages (while not being a silver bullet either, of course).

About The Author

Thom Holwerda

Follow me on Mastodon @thomholwerda@exquisite.social

29 Comments

2024-07-29 10:48 am
Alfman verbose=1
Thom Holwerda,
And while the company doesn’t say it directly, this is clearly where CrowdStrike failed – and failed hard.
What’s so astonishing is that crowdstrike releases updates without any staging. They just go out to the world such that everyone with computers turned on is affected simultaneously and it leaves no time to apply the brakes when an update fails. Some Microsoft updates have failed hard too, a few years ago windows 10 updates actually caused data loss. Microsoft at least has the sense to stage updates. Even with the bug, staging could have prevent the global crowdstrike calamity affecting nearly all their customers.
While you would want a security tool like CrowdStrike to perform as little as possible in kernelspace, and conversely as much as possible in userspace, that’s not what CrowdStrike did. They are running a lot of stuff in kernelspace that really shouldn’t be there, such as the update mechanism and related tools. In total, CrowdStrike loads four kernel drivers, and much of their functionality can be run in userspace instead.
I don’t know much about the product, but kernel support for updates might be necessary to eliminate downtime and the need to reboot, although I do appreciate the irony here.
Microsoft’s post-portem further details a ton of work Microsoft has recently done, is doing, and will soon be doing to further strenghthen Windows’ security, to lessen the need for kernelspace security drivers even more, including adding support for Rust to the Windows kernel, which should also aid in mitigating some common problems present in other, older programming languages (while not being a silver bullet either, of course).
It should be no surprise at all that, once again, unsafe languages are the culprit. What we actually need are operating systems that are built using safe languages from the ground up, NOT operating systems that merely support safe languages. Unfortunately I don’t see that happening. Safer operating systems exist, but does anyone really think they’re going to displace the incumbents? It’s always the same “we’re taking steps to make sure it doesn’t happen again”, but the underlying root cause that is unsafe languages. Both microsoft and linux are too vested in the existing OS to replace unsafe language practices at their core.
2024-07-29 11:29 am
steveftoth
I am still in favor of placing the blame with the people who installed the CrowdStrike drivers in the first place. Companies are installing an optional piece of software on their systems which they should be able to control. Users of CrowdStrike or any AV software, if they have uptime requirements need a plan for recovery to ensure that any downtime is minimized. This was the true failing. CS is just one of many systems that could fail at any time on modern computer systems and the idea that many systems will be down for weeks because of a bad software update is the real elephant in the room.
Yes CS can be improved, but there are still harder problems to solve here. Most have to do with cost since the reason that most of these systems had the auto-update and ‘latest version’ options set was because orgs wanted to save money and feel safe (save more money).

2024-07-29 7:16 pm
lancealot
I agree with the idea of placing a big part of the blame with companies that decided on using Crowdstrike in the first place. For the most part this didn’t effect normal users, this effected large Fortune 500 companies and large government organizations. You would assume both of those would be able to have a well paid and competent IT department. I personally think the decision to use Crowdstike at a lot of these places came down to management making an uninformed choice based on what other management or marketing promises have told them. Any good IT department worth a grain of salt would have realized that running any software at the OS kernel level is dangerous, and even more dangerous if this module could be feed updates to software running at this level. Pushing updates to something at the kernel level at a real-time ongoing basis is very very risky. In cases like this you need to consider your options carefully.
Another set of the blame is pointed at Crowdstike for not staging the software pushes. This was a major mistake by Crowdstrike, but once again I blame the people who were using Crowdstrike since it didn’t allow for lower level of approving and staging of updates by the organization themselves. Crowdstrike says in their remediation (https://www.crowdstrike.com/falcon-content-update-remediation-and-guidance-hub/): “Provide customers with greater control over the delivery of Rapid Response Content updates by allowing granular selection of when and where these updates are deployed.”. What good IT tech department worth a grain of salt would have installed software that didn’t allow this from the get go, especially software as mentioned above that is very dangerous in nature due to running at the kernel level.
So it agree with this point and say it came down to incompetent IT departments at these companies, or more likely incompetent management that decided to pay for (it is not cheap) and use Crowdstrike.

2024-07-29 12:06 pm
Geck
So after all this the conclusion is Rust will save us all? Yes, we are doomed.

2024-07-29 1:16 pm
Alfman verbose=1
Geck,
So after all this the conclusion is Rust will save us all? Yes, we are doomed.
Nobody is claiming bugs cannot be written in safe languages, but the classes of bugs that safe languages do protect us from should have been exterminated by now. There’s always somebody who thinks “I know better than to make such mistakes. We don’t need to change languages, we just need more competent devs” or whatever….but it’s this failure to mitigate human error that keeps us repeating the same software faults over and over and over again.
To be fair, I think lots of devs understand the importance and I am happy that rust is being embraced as much as it is. But most of that is side projects. IMHO we need to make stronger commitments to significantly reduce our dependence of unsafe languages at the core of critical kernels and infrastructure. It’s really not a windows versus linux (or macos) thing, they are all built around legacy languages and while this is going to make some people uncomfortable, all of our dominant operating systems are kind of in the way of progress.

2024-07-29 1:42 pm
Geck
A while back i was reading some news in regards to Cosmic desktop environment and one thing that i noticed in comment section was somebody complaining Cosmic desktop environment decided to implement a tray area. Like being extremely offended by it, due to i assume being totally brainwashed by GNOME. The reason i shared this anecdote here is i find the conclusion, “Rust will save us all”, i find such conclusion in the same league of craziness, for Rust to even be mentioned here. Totally irrelevant, when it comes to Microsoft Windows and “safety” in at least the next 30 years or so. That is if Microsoft Windows will even exist then.

2024-07-29 2:19 pm
Alfman verbose=1
Geck,
The reason i shared this anecdote here is i find the conclusion, “Rust will save us all”, i find such conclusion in the same league of craziness,
I don’t see anybody trying to portray rust as a panacea that will save us all, including the article from microsoft. Criticizing a line that you wrote yourself comes across as a straw man. Everyone knows there were a multitude of failures here.
for Rust to even be mentioned here. Totally irrelevant, when it comes to Microsoft Windows and “safety” in at least the next 30 years or so
Why is it irrelevant? The bug identified is a natural byproduct of programming with unsafe languages.
CrowdStrike recently published a Preliminary Post Incident Review analyzing their outage. In their blog post, CrowdStrike describes the root cause as a memory safety issue—specifically a read out-of-bounds access violation in the CSagent driver.
It seems very relevant to me. If ever there was a good time to discuss the dangers of and potential alternatives to unsafe programming, why not now?

2024-07-29 3:51 pm
Geck
Because it’s mostly irrelevant and we should value our time better. Believing if Microsoft Windows would actually get rewritten in Rust, it won’t, we wouldn’t have a similar discussion here today, or at that point in the future. We would. So, this Rust side projects, when it comes to Microsoft Windows, why not, if that is what Microsoft believes will make Windows future proof or as you say “safe”. But saying that they are now actually solving such issues, from occurring in the future, or claiming that Rust is the solution, and things will only get better, now that has little to do with reality, when it comes to Microsoft Windows. Microsoft Windows is basically legacy software and i don’t see a realistic scenario on how that could change in foreseeable future. It is what it is and the rest is mostly PR BS.
2024-07-29 6:33 pm
Alfman verbose=1
Geck,
Because it’s mostly irrelevant and we should value our time better.
I think you are underestimating just how immense the social costs of such vulnerabilities are. They expose us to an onslaught of both accidental flaws as well as malicious attacks.
Believing if Microsoft Windows would actually get rewritten in Rust, it won’t, we wouldn’t have a similar discussion here today, or at that point in the future.
What gave you the impression I believed that?
A quote from myself should put that to rest…
What we actually need are operating systems that are built using safe languages from the ground up, NOT operating systems that merely support safe languages. Unfortunately I don’t see that happening.
…
Both microsoft and linux are too vested in the existing OS to replace unsafe language practices at their core.
Safe languages may not be important to you, but you of all people should understand the importance of being an advocate for progress.
Microsoft Windows is basically legacy software and i don’t see a realistic scenario on how that could change in foreseeable future. It is what it is and the rest is mostly PR BS.
To you this is all a linux versus windows platform flame war. But your criticism of windows could just as well have been said of linux too: “Linux is basically legacy software and i don’t see a realistic scenario on how that could change in foreseeable future”. Unsafe programming is an industry-wide problem. Everyone should be taking stalk of where we need to improve. We know that more of the same won’t fix it.
2024-07-29 7:33 pm
Geck
OK so the main objective would be to reduce memory safety oriented issues, so far so good. But then neither you or i believe Windows will ever really achieve that. Beyond some gradual and partial progress being made.So in the end it doesn’t matter if i or you believe such languages are important or not as in regards to Windows they won’t make much difference. That was my point. As for Linux i feel that once Linus will retire and the new generation to take over, there likely indeed will be a big void involved. Like lets say once they silenced Stallman, new generation of FOSS philosophers is more focused on things like gender and minorities, a lot of time being toxic about it. Software itself and relations to things like other software models out there, there is no ongoing discussion about that any more. We are riding on fumes here, past achievements. The rest is left to big corporations, opinion making and BS.
2024-07-29 10:44 pm
Alfman verbose=1
Geck,
OK so the main objective would be to reduce memory safety oriented issues, so far so good. But then neither you or i believe Windows will ever really achieve that. Beyond some gradual and partial progress being made.So in the end it doesn’t matter if i or you believe such languages are important or not as in regards to Windows they won’t make much difference.
This is exactly why we need to keep talking about safe languages. Regardless of my feelings about microsoft, I’ll defend both microsoft’s Vice President of Enterprise and OS Security as well as Thom Holwerda for highlighting the efforts to mitigate these kinds of faults using safer languages. I should thank you too for providing the opportunity to have this debate. Just having these debates at all is a net positive for safe languages. I’m not concerned about people taking a stance against safe languages, I am more concerned about the apathy when people don’t talk about them. When outages happen, we should want executives, project managers, and the general public to have the connection between unsafe programming languages and the corresponding faults that costed them money, left them stuck at the airport, broke hospital systems, etc. This helps to build grass roots momentum that cannot happen if we don’t talk about these issues.
2024-07-30 8:46 am
Geck
I don’t know to be honest. If somebody would ask me is current generation capable of producing and maintaining something like GNU/Linux or even Microsoft Windows, for the next couple of decades, my honest answer would be no way, they simply don’t have what it takes. And now this generation, due to i guess being incapable of achieving that, is trying to glue Rust in existing software stacks, predominately into C/C++ software stacks. Now where could this possibly go wrong? What will for example happen, when they will lose interest or some new and shiny thing will emerge, possibly promoted by a generation that will succeed the current one? A new programming language, trying to replace a few decades old Rust. If Microsoft Windows is not suitable for modern age desktop computing any more, then Microsoft or somebody else will need to produce a new solution and try to compete with Windows. As surely a new and modern operating system, “safe”, concurrent, asynchronous, bla bla … Windows doesn’t stand a chance. Right?
2024-07-30 10:38 am
Alfman verbose=1
Geck,
What will for example happen, when they will lose interest or some new and shiny thing will emerge, possibly promoted by a generation that will succeed the current one? A new programming language, trying to replace a few decades old Rust.
Building up momentum is a slow process, but I don’t think it is futile. If we do a good job educating people and applying pressure, then progress is possible. But we need clear messaging: safe languages don’t just matter to geeks, they mater for real people too. Those who’ve been burned by repercussions will not forget because their interest in safe software becomes more personal. We need to make sure they make the mental connection that the flight to their own non-refundable honeymoon was canceled over preventable software vulnerabilities. Or their data was breached because of vulnerabilities that wouldn’t have existed using a safe language. To this end I’m glad microsoft’s president of enterprise security is mentioning their rust initiative in connection to outages caused by an invalid memory access. At least some of the executives reading that will want to avoid becoming the next crowdstrike, they may start their own safe code initiatives.
If Microsoft Windows is not suitable for modern age desktop computing any more, then Microsoft or somebody else will need to produce a new solution and try to compete with Windows. As surely a new and modern operating system, “safe”, concurrent, asynchronous, bla bla … Windows doesn’t stand a chance. Right?
Competition drives change like nothing else can. I have little doubt that both linux and windows could pivot in response to serious competitive threats to their existence. The problem is their roles are balanced in the market and this allows them to remain complacent. Still they’re both taking baby steps matching each other’s moves in terms embracing rust/safe languages. The slow/steady path towards progress is still progress.
The issue I have is that the incremental approach can create more evolutionary baggage than designs that start with a clean slate. Both windows and linux already carry alot of evolutionary baggage. As a developer I strongly prefer having a clean slate, but as a realist I’m not so sure it works commercially. There’s also the fact that traditionally the microsofts and googles have a reputation for being very noncommittal on big changes, which frankly consumers can sense. So maybe slow incremental changes is the way we have to move forward.

2024-07-29 4:36 pm
Adurbe
Microsoft have been building it in since 2019 with Project Verona

2024-07-29 5:02 pm
Geck
Exactly. Each big IT company inventing their own pet programming language, all with an occasional PR on how everybody should use Rust to be “safe” and congratulating themself on how they actually added a couple lines of code written in Rust. All of them in reality using C/C++ for majority of their software stack. It’s really hard to take them seriously in this regards and in any meaningful way.

2024-07-29 6:48 pm
Alfman verbose=1
Geck,
Exactly. Each big IT company inventing their own pet programming language, all with an occasional PR on how everybody should use Rust to be “safe” and congratulating themself on how they actually added a couple lines of code written in Rust. All of them in reality using C/C++ for majority of their software stack. It’s really hard to take them seriously in this regards and in any meaningful way.
Developers are notorious for resisting change. It’s quite natural for us to stay attached to what we grew up with. I do see growing interest around safe languages, but it’s just difficult to achieve critical mass when older generations are the principal project managers. Still, we will all inevitably retire out of the market and the uptake of safe languages by new generations could help us turn the corner.

2024-07-29 7:16 pm
Geck
Actually the main problem here is software rewrites tend to not be successful. Due to cost, time needed, technical and cultural challenges involved … On top of that how could you justify rewriting Windows, even if you succeed, you would still end up with something being considered outdated. Like lets say rewriting Xorg in Rust. And even here i feel that Xorg has much better chance in achieving that.
2024-07-29 10:05 pm
Alfman verbose=1
Geck,
Actually the main problem here is software rewrites tend to not be successful. Due to cost, time needed, technical and cultural challenges involved …
That’s true but you cannot forget to factor in the multiplying effects of time. Something that’s bad isn’t just a one time cost, it’s an ongoing cost that accumulates. Even something that’s hard and expensive to change can still be worth it long term. Our problem is that society often puts short term thinkers in positions of leadership.
It reminds me of the national debt. Politicians get cheered on for short term perks like cutting business taxes. But when their plan doesn’t follow sound economically principals, it becomes unsustainable and hurts us long term. This ultimately leave us much worse off.
https://www.cbsnews.com/news/federal-debt-interest-payments-defense-medicare-children/
Federal spending on interest payments is forecast to hit $870 billion this year — exceeding the $822 billion that the nation will spend on defense in 2024
…
The nation’s ballooning debt stems chiefly from tax cuts enacted by former President Donald Trump in 2017, as well as the surge in federal aid to keep the economy afloat during the pandemic (assistance authorized by both Trump and President Joe Biden). On top of that, with the Federal Reserve turning to its most effective anti-inflation tool — higher interest rates — the U.S. is paying more for its growing pile of debt.
That’s steering the U.S. into uncharted territory, according to some policy experts. The problem, they say, is that the nation’s mounting debt and interest payments could eventually squeeze federal spending, making it harder to fund core programs like Social Security or to invest in initiatives that drive economic growth, such as infrastructure or education.
Having leaders that only look at the short term hurts our long term ambitions.
2024-07-30 8:32 am
Geck
Still, the idea Microsoft Windows will get rewritten in Rust. No way. As for the debt:
https://www.usdebtclock.org/
2024-07-30 10:58 am
Alfman verbose=1
Geck
Still, the idea Microsoft Windows will get rewritten in Rust. No way.
But I take it that you accept my point that long term the cost/benefits start to favor addressing the deficiencies that keep compounding costs over time rather than ignoring them for perpetuity?
For the record, I just want there to be a competitive viable OS that uses safe programming and I don’t particularly care that it be windows. However I do think the existence of such a competitor would light a fire under microsoft to accelerate their own safe language initiatives as a response.

2024-07-29 12:27 pm
Shiunbird
Habsburg IT. Complex populations without genetic variety can be easily driven into extinction by environmental incidents.
I really don’t get why everything must run Linux or Windows. Why aren’t airport screens just dumb character terminals? Check-in desks could go back to being dumb terminals like back in the day as well. But no, they must include all the Windows stack including support for serial modems, joysticks from the 90s, WoW, etc, etc.. The only valid reason why information screens in airports must run full-stack Windows is to be able to blast ads.
Variety and competition bring resilience to the system. These monopolies will be our doom.

2024-07-29 7:09 pm
Bill Shooter of Bul Platinum Prime
Then you would join me in applauding Southwest airlines, which is still using windows 3.1 and windows 95, which was incompatible with CrowdStrike

2024-07-29 10:56 pm
djhayman
No they aren’t: https://www.osnews.com/story/140301/no-southwest-airlines-is-not-still-using-windows-3-1/

2024-07-29 11:42 pm
Alfman verbose=1
djhayman,
No they aren’t:
I assumed Bill Shooter of Bul had read that too and was being sarcastic, but it shows that sarcasm doesn’t always play without an obvious tell. Anyway you are right that southwest airlines isn’t running on windows 3.1, it is in fact reactos.

2024-07-30 1:43 am
cpcf
All jokes aside, many years ago I nearly had to implement a ReactOS solution for a site using legacy Win95/NT 4.0 era hardware with a similarly vintage machine control application. Ironically the thing that ultimately stopped me, driver issues, if I had to do it today I’d probably get started with Wine.

2024-07-29 9:11 pm
cpcf
Who other than an agent of chaos genuinely thinks a blanket roll out of untested code to mission critical infrastructure is a good idea?

2024-07-31 1:37 am
tux2bsd
There is no way it wasn’t tested. It wasn’t tested well enough.
You get what you deserve when you allow a “security” outfit to rootkit your machines.

2024-07-31 2:18 am
Alfman verbose=1
tux2bsd,
There is no way it wasn’t tested. It wasn’t tested well enough.
The reports were it passed their automated testing. And the employee Creed working at Q/A blew it off.
https://www.youtube.com/watch?v=glfRtqbHJjs
and even if there were employed responsible for Q/A the routine nature of their job may have lead them to be too confident in the automated testing. It’s blindingly obvious after the fact, but things can still go wrong for customers despite internal testing. So it seems especially irresponsible that patches were released to all customers simultaneously with no staging to prevent bugs from hitting all customers all at once.
Edit:
Maybe creed was involved in blew it off the Q/A.
Creed Almost Destroys Dunder Mifflin – The Office
https://www.youtube.com/watch?v=glfRtqbHJjs

2024-07-31 2:24 am
Alfman verbose=1
Edit:
Ah shoot, didn’t have enough time to finish my edit, haha.
Was trying to find another clip where Creed throws another employee under the bus.