CI/CD performance

Continuous integration (CI) and continuous deployment (CD) are best practices for automating your software delivery process. To maintain a high level of confidence that your software is always ready to deliver, you need visibility into your CI and CD pipelines to see where improvements can be initiated. Understanding which builds take the longest time to run and which builds have high failure rates and long recovery times is important to efficient software delivery. Having a view into these potential issues that slow work then empowers you to make the improvements that keep your engineers productive and value flowing, resulting in higher confidence in delivering software on time.

Which jobs take the most time to complete or recover after a failure? The CI/CD performance screen answers these questions by reflecting on the overall health and activity of your product’s CI and CD jobs.

Using this screen, you can:

  • Keep track of the delivery automations that are more important to your team.

  • Determine which jobs run most frequently and spend the most time building so you can look for opportunities to shorten time waiting for feedback from your most important jobs.

  • Determine which jobs fail most frequently so you can pinpoint where to spend your efforts stabilizing builds to improve your confidence that code changes will be delivered successfully.

  • Determine which jobs take the longest to recover when they fail so you shorten the time that your delivery pipeline is broken.

Insights gained

By examining the CI/CD performance results, you can gain insights into a variety of efficiency-related behaviors of the build processes, such as identifying:

  • Jobs with high build failure rates

  • Jobs with high build times that can be improved

  • Long recovery times that need to be investigated

Addressing the issues identified on CI/CD performance can help optimize your software delivery system. Refer to the Recommended practices for examples.

Runs, builds, and jobs

The data used in the CI/CD performance is imported from Jenkins build logs into the System of Record. Most of the data is derived from jobs, runs, and builds.

Build

In Software Delivery Management, a build is a specific instance of a pipeline or build run. For example, a Jenkins build run.

Deployment job

A CI / CD job that deploys code changes or artifacts in production. Deployment jobs are defined using the Manage jobs screen.

Job

a job is the definition of an automation that can be executed.

Requirements

For data to appear on CI/CD performance, you must have:

UI elements

CI/CD performance has two main UI elements.

ci cd perf ui elements
Figure 1. UI elements
Table 1. UI elements
Number Element Use

1

Filters the data displayed by time range.

2

Job data report

Displays information about jobs, runs, build times, failures, failure rates, and recovery times. Refer to Understanding the data for details.

Understanding the data

The CI/CD performance screen helps you understand the overall health of the automated software delivery pipeline. The product is linked to the set of jobs that are part of its delivery process.

The CI/CD performance insights allow us to understand which jobs run the most frequently and how long they typically take to complete. You can use this information to decide which jobs may benefit from optimization investments to shorten your build cycles to gain faster feedback and to get the software flowing more quickly.

ci cd perf data table
Figure 2. Job data table

The failure columns let you see which jobs fail most often. Important jobs with high failure rates may be slowing down, or worse, destabilizing, your delivery process.

Finally, the recovery columns provide insights into how long it takes the job to recover when it fails. High recovery times may indicate that the team is ignoring unreliable builds, which reduces reliability in the delivery pipeline.

Table 2. Column meanings and actions
Column title What it shows How it is calculated Why it is important Actions to take

Job

Name of the job

Not applicable

Job title is linked to the source system.

View the job in the source system.

Runs

Number of times a job has been executed within the selected time range.

Total number of builds that completed during this time range, excluding aborted or canceled builds.

Provides an indication of how often this build is run in your organization.

Sort the column to track jobs that have run the most and least often.

Build time (avg)

The average length of time it takes for a job to build.

Total build time divided by the number of runs.

Indication of the overall time spent waiting on results or artifacts of this job. It can also indicate issues that could block downstream work, like delaying a deployment job. High build times may indicate that the team is waiting on results.

Try to shorten jobs by improving and download times, reuse workspaces, etc. Verify that validation tests are meaningful, non-redundant, and optimized to provide fast results. You can split out tests that are not necessary for every build to run with less frequency or at set intervals.

Build time (total)

Total time this job ran during the selected time range.

Sum of the completed build times.

Provides an overall cost impact of the build on your team or your CI system over the selected time range.

Look for opportunities to reduce the average build time for builds with the highest total build time. Try to shrink the overall build time. By reducing the average build time by 10% on jobs with the highest total build time, you can free up those resources.

Failures

Number of failed builds for a job.

Count of builds with a failed result status.

Provides insights into how often a build is failing. Important jobs with a high number of failures may reduce the overall productivity and throughput of your team and delivery pipeline.

Identify root causes of failures and strategies to lower the number of build failures. See Recommended practices for additional suggestions.

Failure rate

Percentage of failures.

Number of build failures divided by the total number of runs.

Provides insights into the impact of the total number of build failures based on how often the job runs. Important jobs with a high failure rate may reduce the overall productivity and throughput of your team and delivery pipeline. A job with a high total number of failures but a low failure rate may not be worth investigating.

Identify root causes of failures and strategies to lower the number of build failures. Pay special attention to the most impactful jobs with high failure rates. See Recommended practices for additional suggestions.

Recovery (avg)

Average time it takes for a job to return to a successful state after a build failure.

Average time that it takes a job to transition from a failed state to a successful state after an initial build failure occurred.

Indicates how well a team responds to build failures. Long recovery times may indicate that your delivery pipeline is unreliable and not in a ready-to-deliver state.

Encourage your teams to respond proactively to broken builds and invest in methods to reduce failure rates. Pay attention to high recovery times as an indication of areas that need additional care. It may be an indication if a team is apathetic or inured to build failures.

Recovery (total)

Total time this job was in a failed state during the selected time range.

Sum of time it took failed jobs to return to a successful state.

Represents the total delivery outage time caused by the job being in an ineffective state.

Encourage your teams to respond proactively to broken builds and invest in methods to reduce failure rates. Pay attention to high recovery times as an indication of areas that need additional care. It may be an indication if a team is apathetic or inured to build failures.

Viewing and filtering data

Any column in the table can be sorted by selecting the column heading.

The Time range drop-down provides four range options, representing a week, a month, three months (a quarter), and a year. The selected time range changes the period of time represented. The default value is 7 days.

To change the time range:

  1. Select the Time range drop-down.

  2. Select the desired time range: 7 days, 30 days, 90 days, or 365 days.

As a practitioner of DevOps and CI/CD, here are some recommendations to help keep your build processes healthy:

  • Keep build times for key jobs short - Important jobs that run frequently should take as little time as possible so that you are getting feedback fast and the job is always ready for the next commit.

  • Limit the number of failed builds - Failing builds block your delivery process, interrupt the team, and may indicate broken or missing practices in a team’s software development process.

  • When jobs fail, fix them fast - A broken build is blocking your team from delivering new value.

Examples and solutions

Frequently run jobs and key deployment jobs are probably the most important jobs in your product. Ideally, these jobs should have low failure rates and low recovery times.

You can use this page to determine how you can improve these failure rates and recovery times.

Let’s say that you look at the Runs column to determine the most active jobs. The jobs with the most runs will be listed at the top. The CI/CD performance table is ranked by the highest impact on your team based upon the calculated total build time. Ideally, you want the numbers under Failure rate to be as low as possible.

The table below provides some potential actions to take.

Table 3. Insights and actions
Insights from metrics Causes Actions to take

High build time causing team to wait

Job definitions may be doing more than they minimally need to do to verify the integrity of the code change.

Large code bases take a long time to build.

Jenkins controllers may be waiting for agents.

Split the job into multiple jobs to separate the goals of fast feed from comprehensive verification.

Focus testing on most important tests and split comprehensive tests out into another job.

Add automation and avoid using manual approval steps.

If the build server is overloaded, add dedicated build agents or use higher performance machines.

High failure rates

The team could be checking in code with errors and test failures.

Set up verification builds so code is built before the PR is merged. This reduces code check-ins that break the main branch.

Fix unreliable tests that may be causing false failures.

Failing builds aren’t really broken; False failures

Could be caused by unreliable tests. If a test reports false failures, then teams will ignore the results. A broken build may go unnoticed for a long time.

Fix or remove tests that don’t reliably work the same way each time they run.

Build failures take a long time to recover.

The team may be ignoring broken builds. Build fixes may just be complicated.

Job may run infrequently, so the fixed build doesn’t show up until much later.

Monitor build outputs for unexpected recovery time increases.

Encourage a culture of urgency for fixing broken builds.