CrowdStrike blames bug that caused worldwide outage on faulty testing software
The faulty update caused an out-of-bounds memory read that triggered an 'unrecoverable exception.'
CrowdStrike has blamed faulty testing software for a buggy update that crashed 8.5 million Windows machines around the world, it wrote in an post incident review (PIR). "Due to a bug in the Content Validator, one of the two [updates] passed validation despite containing problematic data," the company said. It promised a series of new measures to avoid a repeat of the problem.
The massive BSOD (blue screen of death) outage impacted multiple companies worldwide including airlines, broadcasters, the London Stock Exchange and many others. The problem forced Windows machines into a boot loop, with technicians requiring local access to machines to recover (Apple and Linux machines weren't affected). Many companies, like Delta Airlines, are still recovering.
To prevent DDoS and other types of attacks, CrowdStrike has a tool called the Falcon Sensor. It ships with content that functions at the kernel level (called Sensor Content) that uses a "Template Type" to define how it defends against threats. If something new comes along, it ships "Rapid Response Content" in the form of "Template Instances."
A Template Type for a new sensor was released on March 5, 2024 and performed as expected. However, on July 19, two new Template Instances were released and one (just 40KB in size) passed validation despite having "problematic data," CrowdStrike said. "When received by the sensor and loaded into the Content Interpreter, [this] resulted in an out-of-bounds memory read triggering an exception. This unexpected exception could not be gracefully handled, resulting in a Windows operating system crash (BSOD)."
To prevent a repeat of the incident, CrowdStrike promised to take several measures. First is more thorough testing of Rapid Response content, including local developer testing, content update and rollback testing, stress testing, stability testing and more. It's also adding validation checks and enhancing error handing.
Furthermore, the company will start using a staggered deployment strategy for Rapid Response Content to avoid a repeat of the global outage. It'll also provide customers greater control over the delivery of such content and provide release notes for updates.
However, some analysts and engineers think the company should have put such measures in place from the get-go. "CrowdStrike must have been aware that these updates are interpreted by the drivers and could lead to problems," engineer Florian Roth posted on X. "They should have implemented a staggered deployment strategy for Rapid Response Content from the start."