Change Failure Rate

Context

Change failure time measures the percentage of deployments causing a failure in production requiring remediation (e.g., hotfix, rollback, patch).

A key metric when making changes to systems is what percentage of changes to production (including, for example, software releases and infrastructure configuration changes) fail. In the context of Lean, this is the same as percent complete and accurate for the product delivery process, and is a key quality metric.”

Accelerate: The science of lean software and DevOps: Building and scaling high performing technology organizations

Description

  1. This metric will show % of hotfix releases among all releases completed within interval (month, quarter). 

  2. This metric is based on Task Tracking System's (Jira, Azure Boards or Rally) data. The main pre-requisites are

    1. to have actual fix versions with correct release dates;

    2. hotfix releases should be marked appropriately OR there should be a way to distinguish production top-priority defects being released (depends on calculation logic - see Configuration)

  3. Current interval starts from today - interval duration (1 month/quarter). It's done to make a smooth transition from previous interval to the current one (especially when month/quarter just started). Example: Today is 5 Jan. If release interval is a month, then Jan bar will contain releases from 5 Dec up till now.

Configuration

We have 3 variables in the code of this metric:

  • Number of intervals to show. Basically, how many bars do you want to see on the chart? Default is 9

  • Interval type. If you don't release too often you probably don't want to see gaps for some months - so it's better to set interval type to quarters. As an example, if there was 1 release in Jan, none in Feb and 2 in Mar - you will get 1 release per month in Q1. Default: month.

  • Calculation logic. There are 2 options:

    • 1 = based on hotfix release criteria. It works if you have hotfix releases & can easily distinguish them from other releases.

    • 2 = based on production top-priority defect(s) assigned to release. Here we understand that the release is actually a hotfix based on top-priority production defects assigned to it. Default logic is:

      • 1 blocker priority production defect = hotfix release

      • 1 critical priority production defect gives 20% to a failure rate of a release

      • Example: there were 10 releases, 1 release had 2 blockers, 1 release had 6 criticals, 1 release had 1 critical defect, other 7 releases had no top-priority production defects. Result: CFR = 100*(1+1+0.2)/10 % = 22%

  • Hotfix criteria. The criteria to distinguish hotfixes from regular releases. Default: there are "hot" and "fix" strings in either name or description of the release OR the patch version (if release name is in MAJOR.MINOR.PATCH format) is not equal to 0. Please set your own hotfix criteria for release used on your project.

  • Production defects. Criteria for top-priority production defects.

Code

Here is the code of the metric:

with number_of_intervals_to_show as ( select 9 --how many time intervals to show on the chart as show ), interval_type as ( select 1 --1=month, 2=quarters as i_type ), calculation_logic as ( select 2 --1=based on hotfix release criteria; 2=based on production top-priority defect(s) assigned to release ), hotfix_criteria as --criteria for hotfix releases ( select id from release where (lower(name) like '%hot%' and lower(name) like '%fix%') or (lower(description) like '%hot%' and lower(description) like '%fix%') or (name like '%.%.%' and name not like '%.%.0%') ), interval as --preparing intervals - months or quarters ( select * from ( select now() as i_finish, now()-interval '3 month' as i_start, 2 as type union select (date_trunc('quarter', now()) - interval '3' month* generate_series(0,(select show from number_of_intervals_to_show)-2))-interval '1 second' as i_finish, (date_trunc('quarter', now()) - interval '3' month* generate_series(0,(select show from number_of_intervals_to_show)-2))-interval '3 month' as i_start, 2 as type union select now() as i_finish, now()-interval '1 month' as i_start, 1 as type union select (date_trunc('month', now()) - interval '1' month* generate_series(0,(select show from number_of_intervals_to_show)-2))-interval '1 second' as i_finish, (date_trunc('month', now()) - interval '1' month* generate_series(0,(select show from number_of_intervals_to_show)-2))-interval '1 month' as i_start, 1 as type ) as temp where type=(select i_type from interval_type) ), production_defects as --production top-priority defects criteria ( select * from ticket where priority in ('Blocker','Critical') and (type='Incident' or (type='Bug' and ((environment ~* 'prod|live' and environment !~* 'preprod|pre-prod') or lower(summary) ilike any (array['prod%','[prod%']) or (labels::text ~* 'prod|live|client' and labels::text !~* 'pre-prod|preprod|postprod|post-prod') ))) ), releases as --selecting releases finished within intervals ( select case when (select i_type from interval_type)=1 then to_char(i_finish, 'YYYY MM') else to_char(i_finish, 'YYYY "Q"Q') end as timeline, a.* from interval left join release a on a.finish_date>=i_start and a.finish_date<=i_finish ), tickets_releases as --collecting tickets released, marking production top-priority defects ( select a.timeline, a.name, b.key, b.type, b.created, b.priority, b.environment, b.story_points, b.summary from releases a left join production_defects b on a.id=any(b.fix_releases) ), calculation_logic2 as ( select timeline, name, least(100.0*sum(case when priority='Blocker' then 1 when priority='Critical' then 0.2 else 0 end),100)--/nullif(count(key),0) as cfr from tickets_releases group by timeline, name ) select timeline,change_failure_rate from ( select timeline::varchar as timeline, 100.0*sum(case when id in (select id from hotfix_criteria) then 1 else 0 end)/nullif(count(id),0) as change_failure_rate, 1 as type from releases group by timeline union select timeline::varchar as timeline, avg(cfr) as change_failure_rate, 2 as type from calculation_logic2 group by timeline ) as temp where type = (select * from calculation_logic) order by timeline asc

RAG Thresholds

RAG thresholds:

  • RED: >30% 

  • AMBER: 15%-30%

  • GREEN: <15%

That corresponds to 2022 Google's State of DevOps Report: https://services.google.com/fh/files/misc/2022_state_of_devops_report.pdf

Drill-down

There would be two types of drill-downs based on the calculation logic you’ve chosen.

Calculation Logic 1

Drill-down query to show Release Name, Release Date, Description & Hotfix flag for releases completed within clicked interval:

select name as "Release Name", finish_date as "Release Date", description as "Description", case when id in (select id from hotfix_criteria) then 'Yes' when id is null then null else 'No' end as "Hotfix?" from releases where timeline=clicked_x_value order by finish_date desc

Calculation Logic 2

select name as "Release Name", key, summary, type, priority from tickets_releases where timeline=clicked_x_value and key is not null order by priority asc

Aggregation

If you have several teams and you want to see Change Failure Rate across all your teams, here is an example of how to do that.

Aggregation dataset:

Aggregation metric's code:

As a result, you will see change failure rate for all your teams. 

In the drill-down you will have change failure rate values for each team. Here is the code:

How to improve Change Failure Rate

If you're not in a GREEN zone for this metric, please consider the following ways to reduce your change failure rate:

  • Work on small and self-contained changes: working with smaller versions of changes makes it easier to test them and less likely to break;

  • Leverage infrastructure and application configuration as code: guarantee that mission-critical configurations and services’ infrastructure are visible and reproducible;

  • Use test automation: incorporating automated tests at every stage of the CI/CD pipeline can help you reduce delivery times since it helps catch issues earlier;

  • Create deployment strategies: ad hoc deployments carry a lot of risks. Thus many teams have started using a deployment plan instead. Canary releases, blue-green releases and rolling releases are the three most prevalent deployment methods.