CI/CD performance

CloudBees Software Delivery Management is a preview, with early access for select preview members. Product features and documentation are frequently updated. If you find an issue or have a suggestion, please contact CloudBees Support. Learn more about the preview program.

Continuous integration (CI) and continuous deployment (CD) help automate build and deployment systems. To keep these systems optimized, you need to track which builds take the longest time to run and which builds have high failure rates.

Which jobs take the most time to complete or recover after a failure? The CI/CD performance screen answers these questions by reflecting on the overall health and activity of your product’s CI and CD jobs.

Requirements

For data to appear on CI/CD performance, you must have:

Runs, builds, and jobs

The data used in the CI/CD performance is imported from Jenkins build logs into the System of Record. Most of the data is derived from jobs, runs, and builds.

Build

In Software Delivery Management, a build is a specific instance of a pipeline or build run. For example, a Jenkins build run.

Deployment job

A CI / CD job that deploys code changes or artifacts in production. Deployment jobs are defined using the Manage jobs screen.

Job

a job is the definition of an automation that can be executed.

Understanding the data

The health of a job is reflected by the failures, failure rates, and average and total recovery times. A job’s activity is indicated by runs, average build time, and total build time.

Recovery represents how long it takes for a job to return to a successful run after a failure. Low failure rates with fast recovery indicate good build health.

Column title What it shows How it is calculated Why it is important Actions to take

Job

Name of the job

Not applicable

Job title is linked to the source system.

View the job in the source system.

Runs

Number of times a job has been executed within the selected time range.

Total number of builds that completed during this time range, excluding aborted or canceled builds.

Provides an indication of how often this build is run in your organization.

Sort the column to track jobs that have run the most and least often.

Build time (avg)

The average length of time it takes for a job to build.

Total build time divided by the number of runs.

Indication of the overall time spent waiting on results or artifacts of this job. It can also indicate issues that could block downstream work, like delaying a deployment job. High build times may indicate that the team is waiting on results.

Try to shorten jobs by improving and download times, reuse workspaces, etc. Verify that validation tests are meaningful, non-redundant, and optimized to provide fast results. You can split out tests that are not necessary for every build to run with less frequency or at set intervals.

Build time (total)

Total time this job ran during the selected time range.

Sum of the completed build times.

Provides an overall cost impact of the build on your team or your CI system over the selected time range.

Look for opportunities to reduce the average build time for builds with the highest total build time. Try to shrink the overall build time. By reducing the average build time by 10% on jobs with the highest total build time, you can free up those resources.

Failures

Number of failed builds for a job.

Count of builds with a failed result status.

Provides insights into how often a build is failing. Important jobs with a high number of failures may reduce the overall productivity and throughput of your team and delivery pipeline.

Identify root causes of failures and strategies to lower the number of build failures. See Recommended practices for additional suggestions.

Failure rate

Percentage of failures

Number of build failures divided by the total number of runs.

Provides insights into the impact of the total number of build failures based on how often the job runs. Important jobs with a high failure rate may reduce the overall productivity and throughput of your team and delivery pipeline. A job with a high total number of failures but a low failure rate may not be worth investigating.

Identify root causes of failures and strategies to lower the number of build failures. Pay special attention to the most impactful jobs with high failure rates. See Recommended practices for additional suggestions.

Recovery (avg)

Average time it takes for a job to return to a successful state after a build failure.

Average time that it takes a job to transition from a failed state to a successful state after an initial build failure occurred.

Indicates how well a team responds to build failures. Long recovery times may indicate that your delivery pipeline is unreliable and not in a ready-to-deliver state.

Encourage your teams to respond proactively to broken builds and invest in methods to reduce failure rates. Pay attention to high recovery times as an indication of areas that need additional care. It may be an indication if a team is apathetic or inured to build failures.

Recovery (total)

Total time this job was in a failed state during the selected time range.

Sum of time it took failed jobs to return to a successful state.

Represents the total delivery outage time caused by the job being in an ineffective state.

Encourage your teams to respond proactively to broken builds and invest in methods to reduce failure rates. Pay attention to high recovery times as an indication of areas that need additional care. It may be an indication if a team is apathetic or inured to build failures.

Viewing and filtering data

Any column in the table can be sorted by selecting the column heading.

The Time range drop-down provides four range options, representing a week, a month, three months (a quarter), and a year. The selected time range changes the period of time represented. The default value is 7 days.

To change the time range:

  1. Select the Time range drop-down.

  2. Select the desired time range: 7 days, 30 days, 90 days, or 365 days.

Frequently run jobs and key deployment jobs are probably the most important jobs in your product. Ideally, these jobs should have low failure rates and low recovery times.

You can use this page to determine how you can improve these failure rates and recovery times. Let’s say that you look at the Runs column to determine the most active jobs. The jobs with the most runs will be listed at the top. The CI/CD performance table is ranked by the highest impact on your team based upon the calculated total build time. Ideally, you want the numbers under Failure rate to be as low as possible.

The table below provides some potential actions to take.

Issue Causes Actions to take

High failure rates

Team could be checking in code that breaks things

Set up verification builds so code is built before the PR is merged. Reduces code check-ins that break the main branch.

False failures

Could be caused by flaky tests. If the team can’t trust that a build is really broken, then this causes a behavior of people not believing that broken build are broken. They get ignored.

Make your builds and your failures as trustable as possible.

Verification builds don’t run all of the tests

Tests that fail frequently are also part of your verification tests.

Pre-commit builds get tests. Could get higher quality feedback.