Ooh this one is cool. Incidents - most of what you thought you knew is probably wrong. Root cause analysis? Not meaningful. MTTR? Not a great way to measure distributed systems because averages.
Everything except for DNS. It was DNS
Companies willing to create and share incident reports
The Verica Open Incident Database (VOID) makes public software-related incident reports available to everyone, increasing understanding of software-based failures in order to make the internet a more resilient and safe place. After scrutinizing nearly 10,000 incidents, one thing is crystal clear: Resilience saves time. Taking the time to understand how to better respond when something green turns red—learning from the people, the processes, and the systems—will make your next incident smoother.
Success! You'll be added to the VOID Newsletter, and after confirming your email address you'll receive a second email with the 2022 report download link. Please be patient, it can sometimes take 5-10 min for the report to come through.
Duration Isn't Cut and Dry
Duration of incidents conveys little meaning about the incidents themselves, in part because it can be very tricky to attribute when incidents start or stop.
It's Time To Retire MTTR
Mean Time to Resolve (MTTR) isn’t a viable metric for the reliability of complex software systems for a myriad of reasons, particularly because averages of duration data lie.
Duration and Severity Aren't Related
We found that duration and severity are not correlated—companies can have long or short incidents that are very minor, existentially critical, and nearly every combination in between.
Root Cause Analysis Is On The Decline
Despite adding four times the number of incidents in 2022, the number of RCA-based reports didn't increase proportionally. We even saw a move away from RCA in large enterprise organizations, as they embrace more in-depth analyses.
What People Are Saying About the VOID
The VOID report challenges the “Old View” in what many technology organizations deem as the gold standard for incidents, such as: duration of incidents, MTTR, and Root Cause Analysis. Instead, we can embrace a “New View” that includes learning from incidents beyond just fixing them, deeper and broader incident analysis, humans as the superpower of systems, and an increased focus on successes versus failures when analyzing incidents.
Chad Todd
SRE Manager, Crowdstrike
If you aren't recording and publishing incidents because you want to look good, then you are more likely to have a much bigger failure. This report raises some interesting questions, how can we measure near-misses, and can we find a better metric than Mean Time To Repair (MTTR) given the complex partial failure modes we see? I encourage everyone to publish more, include near misses in your incident reports, and to help everyone else build a safer world as a result.
Adrian Cockcroft
Partner, OrionX & Tech Advisor
The VOID report marks a remarkable advancement in how our community will look at and fix incidents moving forward. Upon seeing the emerging key findings of the report, Jeli was excited to support this research across these large datasets. Through extrapolating the key findings of the report, we are all able to build more resilient systems with greater collaboration.
Nora Jones
CEO, Jeli
The VOID Report is one of those rare and delightful moments of active thought. It takes a given subject matter, in this case claims about incidents in software, as serious and worthy of in-depth consideration. And through a close examination it finds that something doesn't quite make sense. That critique provides an opening for thought, and the sloughing off of received dogma. It's a wonderful example of critical thinking.
Technical CSM, Honeycomb
As SREs we spend a lot of time thinking about incidents, trying to learn from them and understand our world better. The VOID report gives us well-researched data so we can see clearer, and help our organizations learn from our peers across the industry.
Senior Principal Engineer, Equinix
The VOID report represents a great step forward for the IT industry. It is both a demonstration that numerous organizations are transforming their approach to post-incident learning, and an inspiring call for others to recognize the importance of this New Way of looking at incidents. I love the rigorous critique of MTTR, as well as the practical alternatives suggested by the report.
David Leigh
Distinguished Engineer, IBM
Reading that companies are ditching Root Cause Analysis in the same report as we get a fantastic analysis of MTTR fallacies really gave me, a professional pessimist, optimism for the future.
Clint Byrum
Staff Engineer, Spotify
If you loved Accelerate and the DORA Report, this will be right up your alley: a long-overdue, open-sourced data dump of real outages. Yours. Ours. Companies big and small have contributed their outage reports to seed this repo of what really happens when things goes sideways.
Honeycomb
The VOID report is the first industry-wide analysis of the state of software reliability today—in fact, it is the closest thing we have to a 'State of the Union' address. Everyone who designs and operates software systems should read it.
Engineer, Stanza Systems
The VOID project is one of the most significant steps we can take as an industry to improve our operations and safety. This report sets up solid bases for many organizations and practitioners to turn their outage review practices towards more impactful and learning-centric views.
Staff SRE, Honeycomb
The VOID report is an outstanding broad view of patterns in incidents across many organizations. I'm looking forward to the database growing and lending itself to even more research and insights.
Štěpán Davidovič
Senior Staff SRE, Google
The VOID report challenges the “Old View” in what many technology organizations deem as the gold standard for incidents, such as: duration of incidents, MTTR, and Root Cause Analysis. Instead, we can embrace a “New View” that includes learning from incidents beyond just fixing them, deeper and broader incident analysis, humans as the superpower of systems, and an increased focus on successes versus failures when analyzing incidents.
Chad Todd
SRE Manager, Crowdstrike
If you aren't recording and publishing incidents because you want to look good, then you are more likely to have a much bigger failure. This report raises some interesting questions, how can we measure near-misses, and can we find a better metric than Mean Time To Repair (MTTR) given the complex partial failure modes we see? I encourage everyone to publish more, include near misses in your incident reports, and to help everyone else build a safer world as a result.
Adrian Cockcroft
Partner, OrionX & Tech Advisor
The VOID report marks a remarkable advancement in how our community will look at and fix incidents moving forward. Upon seeing the emerging key findings of the report, Jeli was excited to support this research across these large datasets. Through extrapolating the key findings of the report, we are all able to build more resilient systems with greater collaboration.
Nora Jones
CEO, Jeli
The VOID Report is one of those rare and delightful moments of active thought. It takes a given subject matter, in this case claims about incidents in software, as serious and worthy of in-depth consideration. And through a close examination it finds that something doesn't quite make sense. That critique provides an opening for thought, and the sloughing off of received dogma. It's a wonderful example of critical thinking.
Nick Travaglini
Technical CSM, Honeycomb
As SREs we spend a lot of time thinking about incidents, trying to learn from them and understand our world better. The VOID report gives us well-researched data so we can see clearer, and help our organizations learn from our peers across the industry.
Amy Tobey
Senior Principal Engineer, Equinix
The VOID report represents a great step forward for the IT industry. It is both a demonstration that numerous organizations are transforming their approach to post-incident learning, and an inspiring call for others to recognize the importance of this New Way of looking at incidents. I love the rigorous critique of MTTR, as well as the practical alternatives suggested by the report.
Distinguished Engineer, IBM
Reading that companies are ditching Root Cause Analysis in the same report as we get a fantastic analysis of MTTR fallacies really gave me, a professional pessimist, optimism for the future.
Staff Engineer, Spotify