CloudBees CI Modern Best Practices for Reliability

7 minute readKnowledge base

Issue

This document presents a series of Best practices dedicated to CloudBees CI on Modern Platforms which focuses on Reliability.

It is an extension from CloudBees CI Performance Best Practices including the particularities of running a CI platform in Kubernetes.

Scaling

Node Autoscaling

It is strongly recommended that you leverage benefits from Kubernetes' autoscaling mechanisms to automatically scale cluster services with a surge in resource consumption. With Cluster Autoscaler, node and pod volumes get adjusted dynamically on demand, thereby maintaining load to the optimum level and avoiding capacity downtimes.

CloudBees CI autoscaler configuration is presented in the following piece of documentations for EKS and GKE.

Different Node Groups for Applications and Agents

The Lifecycle and mission of CloudBees CI Applications (Operation Center and Controllers) vs Agents are different. The Operation Center centralizes the management of Controllers, and Controllers orchestrate ephemeral agents to perform a build. For this reason, it is convenient to schedule these pods in different types of Nodes by using Taints and Pod Affinity (or Node Selector). For example, Spot type of instance would be a valid for the Agent Node Pool but not for the Applications.

Prevent Scale Down Eviction

CloudBees CI Applications are expensive to evict and must be restarted if interrupted. The Cluster Autoscaler will attempt to scale down any node under the scale-down-utilization-threshold, which will interrupt any remaining pods on the node. This can be prevented by ensuring that pods that are expensive to evict are protected by a label recognized by the Cluster Autoscaler. Expensive to evict pods have the annotation cluster-autoscaler.kubernetes.io/safe-to-evict=false which is included in the product by default.

Overprovisioning

Cluster Autoscaler triggers a scale-up of the worker nodes when Pods in the cluster are already Pending. Hence, there may be a delay between the time CloudBees CI needs more capacity, and when it in fact, obtains that capacity.

If this delay is a problem for your use case, consider the cluster to be overprovisioned. There are different techniques like the one recommended by AWS Cluster autoscaler: pause Pods and the Priority Preemption feature.

Consider Design Quotas and Limits for the Underlying Infrastructure

When designing a CloudBees CI Modern platform it is important to consider the quotas and hard limits for the infrastructure components required. For example, according to EFS Resource quotas the maximum number of access points for each file system is 120. This means that it would require 2 EFS instances to support a CloudBees CI platform containing more than 119 Controllers (including the Operation Center).

Resources Management

Define Guaranteed QoS for CloudBees CI Applications

Correctly sized requests are particularly important when using a node auto-scaling solution. These tools look at your workload requests to determine the number and size of nodes to be provisioned. If your requests are too small with larger limits, you may find your workloads evicted or OOM killed if they have been tightly packed on a node.

Configure Quality of Service of Guaranteed for CloudBees CI Applications Pods by defining requests=limits for the container in the Pod which make the pod assigned a QoS class of Guaranteed. This will ensure that the container will not be killed if another Pod requests resources. (It works similarly that high Pod Priority which indicates the importance of a Pod relative to other Pods. If a Pod cannot be scheduled, the scheduler tries to preempt (evict) lower priority Pods to make scheduling of the pending Pod possible).

References:

Optimizing resources with hibernation of managed controllers

Enable Hibernation Controllers to free up computer capabilities for the less active Controllers.

The CloudBees CI hibernation for Managed Controllers feature takes advantage of running CloudBees CI on Kubernetes by automatically scaling the Kubernetes replicas of a CloudBees CI Managed Controller Statefulset to zero (think of it as shutting down) after a specified amount of inactivity. This results in no Kubernetes pods running for the CloudBees CI Managed Controller Stateful set and no cpu or memory cluster resources being used. This feature will also automatically scale up the replicas of a CloudBees CI Managed Controller Statefulset to one (think of it as starting up a controller) for certain events such as GitHub webhooks, Jenkins cron based job triggers and any HTTP GET to the GET proxy for hibernation link for a Managed Controller.

Namespaces for Resource Usage

Adopting Kubernetes in larger organizations with multiple teams accessing the same cluster requires a custom approach to resource usage. Improper accessibility provisioning often leads to conflicts among teams for resource usage.

To solve this, use namespaces to achieve team-level isolation for teams trying to access the same cluster resources concurrently. Efficient use of namespaces helps create multiple logical cluster partitions, thereby allocating distinct virtual resources among teams thanks to resources like Resource Quotas or Limit Ranges.

Resource Quotas help limit the amount of resources a namespace can use. The LimitRange object can help you implement minimum and maximum resources a container can request. Using LimitRange you can set default requests and limits for containers, which is helpful if setting compute resource limits is not a standard practice in your organization. As the name suggests, LimitRange can enforce minimum and maximum compute resource limits per Pod or Container in a namespace. It can also enforce minimum and maximum storage requests per PersistentVolumeClaim in a namespace.

Consider using LimitRange in conjunction with ResourceQuota to enforce limits at a container as well as namespace level. Setting these limits will ensure that a container or a namespace does not impinge on resources used by other tenants in the cluster.

If required, CloudBees CI Agents and Controllers can be deployed on different namespaces and so can Operation Center and Controllers.

Observability

Run Kubernetes Metrics Server

Combine Jenkins Metrics with Kubernetes metrics and Cloud Vendor Metrics (e.g. GKE metrics ). For example, disk input/output metrics (which is one of the recommended metrics for CloudBees CI to watch) are not included within the Jenkins Metrics; therefore they need to be taken at infrastructure level.

Kubernetes metrics requires that Metric Server is running, which is not enabled by default for all type of installations.

Run Node Problem detector

Installing node-problem-detector add-ons will make various node problems visible to the upstream layers in the cluster management stack. It is a daemon that runs on each node, detects node problems and reports them to the apiserver.

Use Prometheus and Graphana for Monitoring

Kubernetes built-in tools for troubleshooting and monitoring are limited. The metrics-server collects resource metrics and stores them in memory but doesn’t persist them. For that reason a monitoring framework and a time-series database are required.

From all the available solutions, Prometheus is the most convenient in a K8s context for the following reasons

  • Prometheus was accepted to CNCF on May 9, 2016 and is at the Graduated project maturity level. As a result, its support is guaranteed for a long time.

  • It is agnostic to the K8s cluster vendor which could be convenient for scenarios like deploying CloudBees CI across multiple Kubernetes clusters.

  • Prometheus is a pull based system that initiates queries to its targets, hence it avoids adding overhead to the K8s worker nodes (like in the case of using monitoring agents to push data). The whole configuration is done on Prometheus server side and not on the client side.

  • Kubernetes components, by default, emit metrics in Prometheus format (see Metrics For Kubernetes System Components | Kubernetes).

Install CloudBees Prometheus Metrics plugin in CloudBees CI Applications for exposing metrics in Prometheus format.

While Prometheus is all about how to store and query data, Grafana is about how to visualize this data so it is possible to identify issues quickly. They are the perfect combo according to CNCF.

As starting point in your Graphana Dashboard design, the community existing dashboards for Jenkins or CloudBees CI can be used.

References:

Adopt CasC within your Git-based workflow

Deployment and Configuration as Code of CloudBees CI following GitOps principles as the preferred model, using Git as the single source of truth for all automation:

  • Infrastructure as Code for your Cloud Vendor resources (e.g. Kubernetes Cluster)

  • Helm charts for deployment and upgrades of CloudBees CI

  • CasC for CloudBees CI configuration

Backup and Restore mechanism per scope

There are two approaches that can be complementary because they have different scopes:

Fault Tolerant Architecture

Multi Zones Nodes Groups

Whenever it is possible define your Nodes Groups to be deployed in different zones and define volumeBindingMode: WaitForFirstConsumer to the default storage class (More info at Autoscaling issue when provisioning controllers in Multi AZ Environment)

This setting is not required for NFS.

Disaster Recovery

Disaster Recovery architecture for CloudBees CI consisting of Primary (Active) and Secondary (Passive) K8s Clusters will provide a more reliable CI service.

Networking

Install Ingress Controller Separately

CloudBees CI helm chart allows an embedded installation of an Ingress Controller under the ingress-nginx node. It is a better option to install the nginx helm chart separately to gain more flexibility.

Websockets for Connecting to Nodes Outside of the Main Cluster

When connecting any Nodes outside the cluster e.g. as described in Deploying CloudBees CI across multiple Kubernetes clusters use WebSockets which provide the following benefits:

  • Rely on TLS to secure connections, enhancing overall security.

  • Increases productivity by simplifying a previously complex administrative task.

  • Provides easier and more comprehensive support for any controller on any cloud platform.

Agent Pod Templates

Resource Management Limit. Do not use CPU limits for Ephemeral Pods

For agent resource management the following best-practices apply on the container level:

  • For memory, set request and limit to the same value.

  • For CPU, use a request but no limit to allow usage of a potentially idle CPU. Note that availability of requested CPU is still guaranteed per container. More details: CPUThrottlingHigh (Prometheus Alert).

Share Workspaces across agents via PVC

By default ephemeral agents destroy their workspace volume after finishing their builds. To share the workspace content across Agent pods for uses cases like cache dependencies (e.g. maven artifacts) attach an existing PVC to the Agent Pod definition.

Keep container images concise

Use smaller container images as it helps you to create faster builds. As a best practice, you should:

  • use the most minimal base image that still suits your needs

  • Add necessary libraries and packages as required for your build

  • Where possible, use multi-stage builds to only keep the necessary software (leaving behind build dependencies)

Note that smaller images are also less susceptible to attack vectors due to a reduced attack surface.

Docker in Docker is not longer available inside the Cluster

Kubernetes is deprecating support for Docker as a container runtime starting with Kubernetes version 1.20.

Since that version Docker-in-Docker will not be an option for building container images, moreover it required privileged mode to function, which is a significant security concern.

Use Kaniko (or one of the many other options, such as Buildah or img, …​).