Incident benchmark report | FireHydrant

Vendor Sponsor

Fire Hydrant

Research Published

March 29, 2023

Link to research

https://firehydrant.com/reports/incident-benchmarks/

Description

Real-world incident data from the Fire Hydrant platform. Rock-on.

Demographic or Methodology comments

Topic Tags

IncidentsSRE - Site Reliability Engineering

Sample

SaaS Platform Data

Data Source

Product Data

Demographics

Created time

Apr 29, 2023 5:57 AM

Directory name

The Rightstack Research DB

Data Download

This report is based on 53,034 incidents resolved on the FireHydrant platform between 2019 and 2022.

Data points have been anonymized and adjusted to ensure that no one company or set of incidents skewed the results.

In the details

The when and what of incidents

Incidents by company size

Size matters when it comes to the average number of incidents. We found a large difference in the number of incidents between small- and medium-sized companies and larger ones.

10/month

Small

0-599 employees

22/month

Medium

600-2499 employees

49/month

Large

2500-6000 employees

37/month

Enterprise

6000+ employees

Incidents by day and time

Most of the incidents we analyzed occurred mid-week — on Tuesdays, Wednesdays, and Thursdays — between the hours of 11 a.m. and 2 p.m. ET. On the other hand, the least likely times for an incident to occur was between 7 and 9 p.m. and on the weekends.

1,845

Incidents

Sun

9,225

Incidents

Mon

9,931

Incidents

Tue

11,170

Incidents

Wed

10,127

Incidents

Thu

8,490

Incidents

Fri

2,246

Incidents

Sat

But there was one day and time that stood out among the rest when it came to the likelihood of an incident occurring: Wednesday at 1 p.m. ET

Incidents per Hour

Incidents by severity level

We found that low-severity incidents took the lion’s share when it comes to incident breakdown by severity across company size.

Low: 42%

Sev4 + Sev5

Medium: 31%

Sev3 + Unset

High: 27%

Sev1 + Sev2

The average mean time to resolution (MTTR) across all incidents was just over 24 hours. We were surprised to find that there wasn't a large difference in MTTR between high-severity incidents and low-severity incidents — just 30 minutes.

Average time to resolve incidents

Response ready

The who and how of incident response

Responders and roles

Although the average responder team size varied based on incident severity — with 8 responders on high-severity incidents and 5.75 on low-severity ones — we found that there’s a magic number when it comes to responders.

Responders

MTTR increased by 18% when the number of responders jumped up by even one responder — and that’s across all severities.

But it’s not enough just to have the right number of responders on the team — they need to understand their job during the incident. We found that assigning roles to responders during high-severity incidents made a sizable improvement in MTTR.

decrease in MTTR

when roles are assigned

Key takeaway? It’s not just about getting the correct number of people in the room, it’s about ensuring that they understand what’s expected of them during an incident. Document the roles and expectations for your incident response process, then make sure everyone understands the requirements before an incident occurs.

Service catalog

When teams use a service catalog, they’re able to more quickly bring in the subject matter expert or owner of the affected service during an incident. No surprise here — the incidents that had services attached saw a decrease in MTTR.

decrease in MTTR when a service catalog is used

Key takeaway? Similarly to roles, we found that it’s not just about getting the right number of people in the room, it’s about getting the right level of expertise in the room. When you attach services to your incident response plan, you can do this faster, ultimately making a noticeable difference in MTTR.

Communication preferences

We were surprised to see that across incidents of the same severity level, a conference bridge didn’t decrease MTTR or have a major effect on the number of chat messages sent.

Average number of chat messages

61 messages

Hi sev

with bridge

67 messages

Hi sev

without bridge

30 messages

Lo sev

with bridge

37 messages

Lo sev

without bridge

Incidents with a conference bridge attached vs not

Hi sev

with bridge

without bridge

63%

37%

26hrs 56mins MTTR

24hrs 10mins MTTR

Lo sev

with bridge

without bridge

60%

40%

25hrs 9mins MTTR

22hrs 52mins MTTR

Key takeaway? Focus on chat during the incident. In fact, many teams choose to create a channel per incident and use it as an artifact for the retro. If you do choose to use a conference bridge, be selective about who you bring in and be clear about what is or is not happening during the response effort. Mid-incident isn’t the time to start talking about long-term improvements.

Retrospectives

When it came to how often retrospectives were held, there was some work to do. More teams held retros for high-severity incidents than lower ones, but even then, we see lots of room for improvement.

42%

high-severity incidents that completed a retro

29%

low-severity incidents that completed a retro

Key takeaway? We have a long way to go as an industry when it comes to regularly holding retros, but we think they’re a valuable tool in the quest for reliability. Holding retros is a surefire way to kickstart learnings from your incidents, which you eventually invest back in your systems.

Trend spotlight

What can we expect in 2023?

More lower-severity incidents

We saw a large increase in the number of incidents overall but an especially high increase when it comes to low-severity incidents. As incident management becomes about not just more quickly resolving incidents but also learning from them, more teams are being mindful of catching all of their incidents, not just the major ones.

7ec352c0e7b540288e82d1d9d5760cf3

107% more high-severity incidents

20ba4dd185164bd0a70a47ce57aa11cd

163% more low-severity incidents

Put it in practice: Lower-severity incidents can give you a temperature check on the health of your internal systems, helping you identify small problems before big ones occur. Consider creating a new “investigation” severity level that gives responders the space to document and research a low-impact issue without sounding all the alarms.

More services

We saw a mega increase over the course of 2022 in the number of services created. We think this is a reflection of the rise in “you build it, you own it” mentality that bodes well for incident management. The faster you can get the right people in the room, the faster you can resolve.

increase in the number of services created

Put it in practice: The ultimate goal here might be a fully fleshed out service catalog that includes dependencies, owners, and links to operation documentation. To start though, keep it simple — declare ownership around product areas. Each product area should have an engineering team associated with it, and those teams should be trained on your incident response process. Set up your process so when an incident is declared, and you find out what’s broken, a member of the corresponding team is alerted.

More retros

We also spotted a big year-over-year jump in the number of retrospectives that teams average per month. We think that’s tied to an increase in awareness of the value of incidents as learning opportunities.

year-over-year increase in average retros per month by company in 2022

Put it in practice: Contrary to popular belief, the retro isn’t only for high-priority incidents. By skipping the retro, you could be leaving insights about your systems, product, people, and processes on the table. Instead, consider right-sizing the retro for the incident. Incorporate lighter retros that can be done async or with a smaller team. And keep having them! A culture of learning isn’t built overnight.

More external updates

When you’re known for handling your tough moments well, you build trust among your customers. And based on the increase we saw in incidents with status pages attached and the number of updates posted to status pages, it looks like others are starting to feel the same way.

136%

increase in incidents with a status page attached

366%

increase in the number of updates posted to status pages

Put it in practice: For communication to be truly effective, it needs to be accounted for in your incident response plan. Get a status page if you don’t already have one, create communication templates, set a cadence you’ll stick to, document it all, and then set up reminders to send updates. It’s tempting to only concentrate on resolving the incident, but good communication buys you a lot of grace when things go wrong.

Go deeper

Join us February 8 for [Webinar] Proving ROI: How to evaluate and improve how you manage incidents. Learn what metrics you should monitor, common benchmarks, and how to show improvements and prove ROI.

Learn more