Mean Time to Recovery

Context

In a nutshell, Time to recover measures how long it takes to recover from failure. This metric analyzes how quickly your team can identify incidents and their root causes (e.g., a damaged database or a deployment that breaks an existing feature), notify the appropriate people to deal with those incidents, and resolve them fast.

Traditionally, reliability is measured as time between failures. However, in modern software products and services, which are rapidly changing complex systems, failure is inevitable, so the key question becomes: How quickly can service be restored?

Accelerate: The science of lean software and DevOps: Building and scaling high performing technology organizations

Description

  1. This metric will show the difference between creation of a top-priority defect in Task Tracking System (Jira/Azure Boards/Rally) and release date of a fix version this defect is assigned to. Top-priority defect usually indicates a failure, so, the time spent between creation of such defect and release date is a time to recover from production failure.

  2. Calculation logic selects all releases completed within interval & counts mean time to recover for all top-priority production defects assigned to those releases.

  3. This metric is based on Task Tracking System's (Jira, Azure Boards, or Rally) data. The main pre-requisites are

    1. to have actual fix versions with correct release dates;

    2. tickets (defects) being released should be assigned to a fix version;

    3. there should be specific criteria to distinguish production defects - environment, lalel, prefix in summary, etc.

    4. defect priorities should be set appropriately. 

  4. Median time is used while calculating value for interval.

  5. In case when ticket is assigned to multiple releases - we're picking the last release's date while calculating lead time for changes.

  6. Current interval starts from today - interval duration (1 month/quarter). It's done to make a smooth transition from previous interval to the current one (especially when month/quarter just started). Example: Today is 5 Jan. If release interval is a month, then Jan bar will contain releases from 5 Dec up till now.

Configuration

We have 5 variables in the code of this metric:

  • Number of intervals to show. Basically, how many bars do you want to see on the chart? Default is 9

  • Interval type. If you don't release too often you probably don't want to see gaps for some months. As an example, if there was no release during any month - you will not see lead time for changes for that month. In this case if you select quarter intervals, you will get median of all lead times for changes during that quarter. Default: month.

  • Production defects. Criteria for top-priority production defects.

Code

Here is the code of the metric:

with number_of_intervals_to_show as ( select 9 --how many time intervals to show on the chart as show ), interval_type as ( select 1 --1=month, 2=quarters as i_type ), production_defects as --production top-priority defects criteria ( select * from ticket where priority in ('Blocker','Critical') and (type='Incident' or (type='Bug' and ((environment ~* 'prod|live' and environment !~* 'preprod|pre-prod') or lower(summary) ilike any (array['prod%','[prod%']) or (labels::text ~* 'prod|live|client' and labels::text !~* 'pre-prod|preprod|postprod|post-prod') ))) ), interval as --preparing intervals - months or quarters ( select case when (select i_type from interval_type)=1 then to_char(i_finish, 'YYYY MM') else to_char(i_finish, 'YYYY "Q"Q') end as timeline, i_start, i_finish from ( select now() as i_finish, now()-interval '3 month' as i_start, 2 as type union select (date_trunc('quarter', now()) - interval '3' month* generate_series(0,(select show from number_of_intervals_to_show)-2))-interval '1 second' as i_finish, (date_trunc('quarter', now()) - interval '3' month* generate_series(0,(select show from number_of_intervals_to_show)-2))-interval '3 month' as i_start, 2 as type union select now() as i_finish, now()-interval '1 month' as i_start, 1 as type union select (date_trunc('month', now()) - interval '1' month* generate_series(0,(select show from number_of_intervals_to_show)-2))-interval '1 second' as i_finish, (date_trunc('month', now()) - interval '1' month* generate_series(0,(select show from number_of_intervals_to_show)-2))-interval '1 month' as i_start, 1 as type ) as temp where type=(select i_type from interval_type) ), releases as --selecting releases completed within intervals ( select * from release where finish_date>=(select min(i_start) from interval) ), tickets_releases as --collecting tickets released, counting their lead time for changes ( select b.key, b.type, b.created, b.priority, b.environment, b.story_points, b.summary, max(a.finish_date) as release_date --case when tickets assigned to multiple releases - picking the last one from releases a left join production_defects b on a.id=any(b.fix_releases) group by b.key, b.type, b.created, b.priority, b.environment, b.story_points, b.summary ) select timeline::varchar as timeline, (percentile_cont(0.5) WITHIN GROUP (ORDER BY DATE_PART('day', release_date-created)+DATE_PART('hour', release_date-created)/24.0)) as mean_time_to_recovery from interval left join tickets_releases on release_date>=i_start and release_date<=i_finish group by timeline order by timeline asc

RAG Thresholds

RAG thresholds:

  • RED: >7 days 

  • AMBER: 1-7 days

  • GREEN: <1 days

That corresponds to 2022 Google's State of DevOps Report: https://services.google.com/fh/files/misc/2022_state_of_devops_report.pdf

Drill-down

Drill-down query to show Key, Summary, Priority & Mean Time to Recovery for valid top-priority defects released within clicked interval:

select key as "Key", summary as "Summary", priority as "Priority", round((DATE_PART('day', release_date-created)+DATE_PART('hour', release_date-created)/24.0)::numeric, 2) as "Mean Time to Recovery, days" from interval join tickets_releases a on a.release_date>=i_start and a.release_date<=i_finish where timeline=clicked_x_value order by "Mean Time to Recovery, days" desc

Aggregation

If you have several teams and you want to see the Mean Time to Recovery across all your teams, here is an example of how to do that.

Aggregation dataset:

Aggregation metric's code:

select timeline, avg(mean_time_to_recovery) as "Average Mean Time to Recovery (days)" from DORA_Mean_Time_to_Recovery group by timeline order by timeline asc

As a result, you will see the average mean time to recovery for all your teams. 

In the drill-down you will have the mean time to recovery values for each team. Here is the code:

How to improve Mean Time to Recovery

If you're not in a GREEN zone for this metric, please consider the following ways to decrease your mean time to recovery:

  • Improve your organizational structure: the right people can quickly identify the problem’s root cause and resolve it promptly.

  • Use test automation: incorporating automated tests at every stage of the CI/CD pipeline can help you reduce delivery times and help you recover quicker.

  • Clearly document the incident management process: continuously train team members on the process and on how to react in case of incidents.

  • Find potential steps of the incident management process to be automated: any step your team needs to perform manually – assigning a responsible, or creating a document – adds to Time to recover.