library(quantrr)
library(purrr)
library(tibble)
library(dplyr)
Measuring Incident Duration
Analysis to determine thresholds of sample size to reliably detect changes in incident impact (duration) using Monte Carlo simulation.
Questions/TODO
Background
In 2021, Google published a report exploring the limitations of using Mean Time to Resolve (MTTR), Incident Metrics in SRE: Critically Evaluating MTTR and Friends. Since then, additional articles and reports have been published that are also critical of MTTR and TTR in general, explicitly advocating that “organizations should stop using MTTX, or TTx data in general”. From my experience with cybersecurity incident data, I believe it is a mistake to abandon measurements of incidents entirely, and instead ask the question, under what conditions does incident duration indicate improvement?
Key observations and arguments from the 2021 Google paper include:
- Incident duration appears to be lognormal (the paper did not test if the lognormal distribution was the best fit)
- A Monte Carlo simulation comparing two samples of incidents, one with no change in duration, and a second with a 10% reduction in duration, MTTR did not reliably indicate the improvement
- Improvements in MTTR can be observed even when there is no change in the underlying distribution
- Larger sample sizes (more incidents) make detecting the change more reliable - in the original simulation, the sample sizes ranged from approximately 50-300
- Median and 95th percentile don’t work well either
- Geometric mean does just work, but only with a sample size of 500+
- Measuring incident duration doesn’t account for changes in incident frequency
The article also calls out some exceptions where measuring duration is effective: first, when your data is large enough, as with Backblaze statistics on hard disk reliability, and second, when the change in duration is large enough, such as an 80% reduction in duration.
Subsequent reports from VOID referenced the Google paper and argued:
- Incident duration is long-tailed and appear lognormal (2021, 2022)
- “MTTx are Shallow Data”, an oversimplification of a complex, messy event (2021)
- Organizations should stop using MTTx and TTx data (2021, 2022, 2024)
- Incident duration appears lognormal for some organizations, but not for others (2022)
- Replicated the Monte Carlo simulation from the Google paper (2022)
In the cybersecurity domain, the Cyentia Institute has published reports analyzing breach data, looking at both frequency and impact (cost). Their IRIS 2022 report found that, like outage incident duration, security breach losses closely follow a lognormal distribution. Like the Google and VOID reports, the IRIS also points out that the mean is a poor indicator of a ‘typical’ loss, the median or geometric mean are better summaries, and organizations should also be concerned with tail risk - extreme events at the 90, 95, and 99th percentile.
Additionally, despite the challenges with using lognormal data, the 2024 Ransomware Report showed that losses from ransomware have significantly increased between 2019 and 2023.
Analysis
While I agree that the mean (average) is misleading and should not be used with incident impact data (duration or costs), I believe it is a mistake to stop using impact data entirely. A Monte Carlo simulation can help answer the underlying question raised by the Google research, “under what conditions does incident duration indicate improvement?” Put differently, how large of a sample size is needed to detect a given level of improvement (reduction) in incident duration?
Assumptions
While the 2022 VOID Report challenged the notion that incident duration is lognormal, the Google report and my own experience suggest otherwise: similar to cybersecurity impact, incident duration closely follows a lognormal distribution. In my own work, I have found that incidents that are qualitatively different may not be lognormal, and that could well be what was happening in the VOID report, but we lack the context to know for sure. To simplify the analysis, the incident frequency will be held constant. (Side note: Cyentia found that cybersecurity breach frequency followed a Poisson lognormal distribution)
Approach
To determine the threshold of incidents needed to detect changes in duration, we follow a similar methodology as the Google paper:
- Randomly sample from an artificial but representative lognormal distribution of incident duration
- Adjust the duration of one of the populations by a set percentage to simulate improvement
- Determine if there is a statistically significant improvement in incident duration
For step 3, we use the geometric mean and Student’s t-test on the samples, after log-transforming the data. From my research and understanding, this is a reasonable approach if the underlying distributions are truly lognormal, which is true by design.
To determine a representative distribution, we can borrow data from the Google report, which lists the p95 and p50 for three companies - we’ll infer p05 and the lognormal parameters using data from Company B:
lnorm_param(9, 528, 67)
$meanlog
[1] 4.23316
$sdlog
[1] 1.237761
$mdiff
[1] -0.02806642
Using trial and error, we find that 9 minutes is the closest match to p05 when the median is 67 minutes. From this, we can create a representative distribution:
<- 4.25
meanlog <- 1.25
sdlog qlnorm(c(0.05, 0.5, 0.95), meanlog = meanlog, sdlog = sdlog)
[1] 8.970424 70.105412 547.885889
This is reasonably close to the Company B data, and passes the sniff test: 5% of incidents resolved in less than 9 minutes, 50% resolved in 70 minutes or less, and 95% resolved under 548 minutes (9 hours 8 minutes).
Simulations
First, test at the 10% improvement level, with a sample size (\(N\)) of 100, which is about 1 year’s worth of incidents. A single test is successful if the experimental group has a lower geometric mean and a positive t-test (p < 0.05) on the log-transformed data.
<- function(n, improvement) {
single_test <- rlnorm(n, meanlog = meanlog, sdlog = sdlog)
control <- rlnorm(n, meanlog = meanlog, sdlog = sdlog) * (1 - improvement)
experiment
list(gmean(control), gmean(experiment), t.test(log(control), log(experiment)))
}
single_test(100, 0.1)
[[1]]
[1] 59.03588
[[2]]
[1] 61.86311
[[3]]
Welch Two Sample t-test
data: log(control) and log(experiment)
t = -0.24151, df = 197.93, p-value = 0.8094
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.4287449 0.3351876
sample estimates:
mean of x mean of y
4.078145 4.124924
For most tests, the p-value is well above 0.05. How often does the test succeed? We’ll set the bar slightly higher than the Google paper and target a 95% success rate.
We’ll use a slightly different function that is more easily converted into a data frame, and run only 10,000 trials to speed things up:
<- function(n, improvement) {
test_change <- rlnorm(n, meanlog = meanlog, sdlog = sdlog)
control <- rlnorm(n, meanlog = meanlog, sdlog = sdlog) * (1 - improvement)
experiment
c(
control_gmean = gmean(control), experiment_gmean = gmean(experiment),
p_value = t.test(log(control), log(experiment))$p.value
)
}
<- function(n, improvement, trials = 10000, p_value_target = 0.05) {
simulate_test map(1:trials, \(i) as_tibble_row(test_change(n, improvement))) |>
list_rbind() |>
mutate(
success = (.data$control_gmean > .data$experiment_gmean) & (.data$p_value < p_value_target)
)
}
<- simulate_test(100, 0.1)
results head(results, 10)
# A tibble: 10 × 4
control_gmean experiment_gmean p_value success
<dbl> <dbl> <dbl> <lgl>
1 89.5 68.8 0.132 FALSE
2 69.3 53.4 0.136 FALSE
3 80.3 68.9 0.389 FALSE
4 88.2 56.9 0.0141 TRUE
5 64.7 60.6 0.706 FALSE
6 63.7 78.1 0.250 FALSE
7 75.1 75.8 0.959 FALSE
8 83.3 69.3 0.266 FALSE
9 68.4 67.8 0.963 FALSE
10 72.6 60.2 0.289 FALSE
mean(results$success)
[1] 0.0846
For a sample size of 100, we successfully detect a statistically significant change about 8% of the time. We can try larger sample sizes:
simulate_test(1000, 0.1) |>
pull(success) |>
mean()
[1] 0.4682
simulate_test(2000, 0.1) |>
pull(success) |>
mean()
[1] 0.7617
simulate_test(4000, 0.1) |>
pull(success) |>
mean()
[1] 0.9644
A sample size of 4000 is needed to detect a significant change. For a single organization, it would take about 80 years (40 before, 40 after) to gather the data. However, the same volume of data could be gathered from 80 different organizations (40 in each group) in a year.
We can also try larger changes:
simulate_test(100, 0.25) |>
pull(success) |>
mean()
[1] 0.3584
simulate_test(100, 0.50) |>
pull(success) |>
mean()
[1] 0.9752
Interestingly, while the Google paper notes that it would take a very large change (80% reduction) to be detected through measuring incident duration, we find that a 50% reduction is clearly detectable over two years of data.
For the 50% reduction, what is the range of the observed improvements?
simulate_test(100, 0.50) |>
mutate(change = experiment_gmean / control_gmean) |>
summary()
control_gmean experiment_gmean p_value success
Min. : 43.80 Min. :19.51 Min. :0.0000000 Mode :logical
1st Qu.: 64.42 1st Qu.:32.26 1st Qu.:0.0000068 FALSE:264
Median : 70.17 Median :35.12 Median :0.0001205 TRUE :9736
Mean : 70.71 Mean :35.34 Mean :0.0056374
3rd Qu.: 76.24 3rd Qu.:38.14 3rd Qu.:0.0014284
Max. :114.56 Max. :57.94 Max. :0.6738022
change
Min. :0.2616
1st Qu.:0.4435
Median :0.5001
Mean :0.5078
3rd Qu.:0.5641
Max. :0.9259
The observed improvements range from a small increase/decrease to a 75% reduction, with half falling within 6% of the true decrease (44-56%).
Discussion
So, what does this analysis tell us?
- MTTR (Mean Time to Resolve) is a misleading metric, and organizations should stop using it.
The Google, VOID and Cyentia research all agree that the mean (average) is a poor summary metric for lognormal and other long-tail distributions. I also agree and that assumption is built-in to the analysis. However, I agree with the Cyentia research that the geometric mean is a reasonable value for the ‘typical’ incident - this is suggested in the Google paper.
- Large samples are required to detect changes in underlying incident duration, making it less useful at an organizational level.
The Google paper found that sample sizes of 500+ were needed to detect a 10% reduction in the underlying distribution; this analysis suggested 4000 is a better threshold. Even at organizations with high levels of incidents, this represents several years of data, enough that only large changes (50% or greater reduction in underlying TTR) will be detected in a reasonable time frame.
- TTR (incident duration) can be a useful measure at a larger scale, including comparisons across different organizations.
While at an individual organization level, incident duration (TTR) is noisy, it becomes meaningful at large scales. Some organizations do experience repeatable failures at scale, including the Backblaze hard drive failure example cited in the Google paper, but TTR is more useful for comparisons across organizations. I believe this explains why the Accelerate State of DevOps Report (DORA) research is able to find meaningful correlations between “Failed deployment recovery time”, the other 3 key DORA metrics, and other measures of performance. The 2024 report surveyed 3,000 people, which should represent hundreds (or more) different organizations. This would appear to be above a threshold to detect differences, especially given the range of possible responses (1 hour, 1 day, 1 week, 1 month, 6 months, 1 year).
In the cybersecurity space, large (public) incidents happen relatively infrequently, about once every 3-10 years, depending on the size of the organization, according to the Cyentia IRIS report. Because of this, metrics on breaches gathered at the organizational level don’t provide meaningful insights, as I argued in Measuring Changes in Breach Rates. However, we’ve also learned that looking at data across organizations can and does provide useful learning. We can accomplish the same thing in the reliability/resilience space by using similar measures.
TTR is a useful measure for cross-company comparison because it is a shallow measure, and reflects the external consumer/user experience. One study I’d like to see is a deep-dive analysis on data from Downdetector, examining differences in groups or clusters of organizations (perhaps this has already been done!) which could identify common factors that contribute to success.