Real-world incident data from the Fire Hydrant platform. Rock-on.
Data Download
This report is based on 53,034 incidents resolved on the FireHydrant platform between 2019 and 2022.
Data points have been anonymized and adjusted to ensure that no one company or set of incidents skewed the results.
In the details
The when and what of incidents
Incidents by company size
Size matters when it comes to the average number of incidents. We found a large difference in the number of incidents between small- and medium-sized companies and larger ones.
10/month
Small
0-599 employees
22/month
Medium
600-2499 employees
49/month
Large
2500-6000 employees
37/month
Enterprise
6000+ employees
Incidents by day and time
Most of the incidents we analyzed occurred mid-week — on Tuesdays, Wednesdays, and Thursdays — between the hours of 11 a.m. and 2 p.m. ET. On the other hand, the least likely times for an incident to occur was between 7 and 9 p.m. and on the weekends.
1,845
Incidents
Sun
9,225
Incidents
Mon
9,931
Incidents
Tue
11,170
Incidents
Wed
10,127
Incidents
Thu
8,490
Incidents
Fri
2,246
Incidents
Sat
But there was one day and time that stood out among the rest when it came to the likelihood of an incident occurring: Wednesday at 1 p.m. ET
Incidents per Hour
Incidents by severity level
We found that low-severity incidents took the lion’s share when it comes to incident breakdown by severity across company size.
Low: 42%
Sev4 + Sev5
Medium: 31%
Sev3 + Unset
High: 27%
Sev1 + Sev2
The average mean time to resolution (MTTR) across all incidents was just over 24 hours. We were surprised to find that there wasn't a large difference in MTTR between high-severity incidents and low-severity incidents — just 30 minutes.
24
05
Average time to resolve incidents
Response ready
The who and how of incident response
Responders and roles
Although the average responder team size varied based on incident severity — with 8 responders on high-severity incidents and 5.75 on low-severity ones — we found that there’s a magic number when it comes to responders.
6
Responders
MTTR increased by 18% when the number of responders jumped up by even one responder — and that’s across all severities.
But it’s not enough just to have the right number of responders on the team — they need to understand their job during the incident. We found that assigning roles to responders during high-severity incidents made a sizable improvement in MTTR.
0%
decrease in MTTR
when roles are assigned
Key takeaway? It’s not just about getting the correct number of people in the room, it’s about ensuring that they understand what’s expected of them during an incident. Document the roles and expectations for your incident response process, then make sure everyone understands the requirements before an incident occurs.
Service catalog
When teams use a service catalog, they’re able to more quickly bring in the subject matter expert or owner of the affected service during an incident. No surprise here — the incidents that had services attached saw a decrease in MTTR.
0%
decrease in MTTR when a service catalog is used
Key takeaway? Similarly to roles, we found that it’s not just about getting the right number of people in the room, it’s about getting the right level of expertise in the room. When you attach services to your incident response plan, you can do this faster, ultimately making a noticeable difference in MTTR.
Communication preferences
We were surprised to see that across incidents of the same severity level, a conference bridge didn’t decrease MTTR or have a major effect on the number of chat messages sent.
Average number of chat messages
61 messages
Hi sev
with bridge
67 messages
Hi sev
without bridge
30 messages
Lo sev
with bridge
37 messages
Lo sev
without bridge
Incidents with a conference bridge attached vs not
Hi sev
with bridge
without bridge
63%
37%
26hrs 56mins MTTR
24hrs 10mins MTTR
Lo sev
with bridge
without bridge
60%
40%
25hrs 9mins MTTR
22hrs 52mins MTTR
Key takeaway? Focus on chat during the incident. In fact, many teams choose to create a channel per incident and use it as an artifact for the retro. If you do choose to use a conference bridge, be selective about who you bring in and be clear about what is or is not happening during the response effort. Mid-incident isn’t the time to start talking about long-term improvements.
Retrospectives
When it came to how often retrospectives were held, there was some work to do. More teams held retros for high-severity incidents than lower ones, but even then, we see lots of room for improvement.
42%
high-severity incidents that completed a retro
29%
low-severity incidents that completed a retro
Key takeaway? We have a long way to go as an industry when it comes to regularly holding retros, but we think they’re a valuable tool in the quest for reliability. Holding retros is a surefire way to kickstart learnings from your incidents, which you eventually invest back in your systems.
Trend spotlight
What can we expect in 2023?
More lower-severity incidents
We saw a large increase in the number of incidents overall but an especially high increase when it comes to low-severity incidents. As incident management becomes about not just more quickly resolving incidents but also learning from them, more teams are being mindful of catching all of their incidents, not just the major ones.
107% more high-severity incidents
163% more low-severity incidents
Put it in practice: Lower-severity incidents can give you a temperature check on the health of your internal systems, helping you identify small problems before big ones occur. Consider creating a new “investigation” severity level that gives responders the space to document and research a low-impact issue without sounding all the alarms.
More services
We saw a mega increase over the course of 2022 in the number of services created. We think this is a reflection of the rise in “you build it, you own it” mentality that bodes well for incident management. The faster you can get the right people in the room, the faster you can resolve.
0%
increase in the number of services created
Put it in practice: The ultimate goal here might be a fully fleshed out service catalog that includes dependencies, owners, and links to operation documentation. To start though, keep it simple — declare ownership around product areas. Each product area should have an engineering team associated with it, and those teams should be trained on your incident response process. Set up your process so when an incident is declared, and you find out what’s broken, a member of the corresponding team is alerted.
More retros
We also spotted a big year-over-year jump in the number of retrospectives that teams average per month. We think that’s tied to an increase in awareness of the value of incidents as learning opportunities.
0%
year-over-year increase in average retros per month by company in 2022
Put it in practice: Contrary to popular belief, the retro isn’t only for high-priority incidents. By skipping the retro, you could be leaving insights about your systems, product, people, and processes on the table. Instead, consider right-sizing the retro for the incident. Incorporate lighter retros that can be done async or with a smaller team. And keep having them! A culture of learning isn’t built overnight.
More external updates
When you’re known for handling your tough moments well, you build trust among your customers. And based on the increase we saw in incidents with status pages attached and the number of updates posted to status pages, it looks like others are starting to feel the same way.
136%
increase in incidents with a status page attached
366%
increase in the number of updates posted to status pages
Put it in practice: For communication to be truly effective, it needs to be accounted for in your incident response plan. Get a status page if you don’t already have one, create communication templates, set a cadence you’ll stick to, document it all, and then set up reminders to send updates. It’s tempting to only concentrate on resolving the incident, but good communication buys you a lot of grace when things go wrong.
Go deeper
Join us February 8 for [Webinar] Proving ROI: How to evaluate and improve how you manage incidents. Learn what metrics you should monitor, common benchmarks, and how to show improvements and prove ROI.