Cluster operations

Cluster operations is a facility to perform maintenance operations on various items in Operations Center, such as Client Masters and update centers. Different operations are applicable to various items such as performing backups or restarts on Client Masters, or upgrade/install plugins in update centers etc.

The main way of running these operations is either via a custom project type, or some preset operations embedded at different locations in the Jenkins UI.

Cluster operation projects

You create a Cluster Operation Project in the same way as you would any other project in Jenkins, by selecting New Item in the view you want to create it in, giving it a name, and selecting Cluster Operation as the item type.

Concepts

A Cluster Operation Project can contain one or more Operations that are executed in sequence one after the other when the project runs.

An Operation has:

  • A type that it can operate on, for example, Client Master or Update Center.

  • A set of target items that it will operate on that is obtained from a selected Source and reduced by a set of configured Filters. The target items will be operated on in parallel, and the max number of parallel items can be configured in the Advanced Operation Options section.

  • A list of Steps to perform in sequence on each target item.

The available sources, filters and steps depend on the target type that the Operation supports.

Tutorial

On the root level or within a folder of Operations Center,

Select New Item.

Specify a name for the cluster operation, for example "Quiet down all masters".

Select Cluster Operation as the item type.

new item
Figure 1. Creating a new cluster operation

You will then be directed to the configuration screen for the newly created project.

configure
Figure 2. Creating a new cluster operation

Click on Add Operation and select Masters to add a new operation with the Client Master target type.

add operation
Figure 3. Creating a new cluster operation

Select From Operations Center Root as the source and add the filter called Is Online. This will select all Client Masters in Operations Center that are online when the operation is run.

We have now specified what to run the operation on and next we will specify what to run on them by adding two steps.

Click on Add Step, select Execute Groovy Script on Master and enter the code:

System.out.println("==QUIET-DOWN@" + new Date());

This will print the text and the current date and time to the log on the Jenkins master which can be handy for audit later on.

Add a new step called Prepare master for shutdown.

This step performs functions similar to what you would get if you clicked on Prepare for Shutdown on the Manage Jenkins page on each master.

Your configuration should look something like the following when you’re done:

operation configured
Figure 4. Creating a new cluster operation

Save and then Run like any normal project. When the operation starts it will run each Client Master in parallel. And afterwards you can see on each Client Master the standard notice Jenkins is going to shut down.

The user that runs the operation will need the RUN_SCRIPT permission on the Client Master in order for the Groovy step to work, as well as ADMINISTER for the prepare for shutdown step. Otherwise the operation run will fail.

Controlling how to fail

Sometimes it’s desirable to modify how a failure affects the rest of the operation flow. On the configuration screen for the cluster operation project, in each Operation section, there is an Advanced button. Selecting it to reveal some advanced control functions like max number of masters to run in parallel, but also Failure Mode and Fail As.

  • Fail As is a way to set the Jenkins build result that the run will get: Failure, Abort, Unstable, and so on.

  • Failure Mode controls what happens to the rest of the run if an operation step on an item fails.

    • Fail Immediately: means Abort anything in progress and fail immediately

    • Fail Tidy: means wait for anything currently running to finish, then fail. (All operations in the queue are cancelled.)

    • Fail At The End: means let everything run to the end, and then fail.

Ad-hoc manual cluster operations

Operations Center comes with a couple of preset cluster operations that can be run on selected Client Masters directly from the side panel of a list view or Client Master page. The list of preset cluster operations is "stored" under Manage Jenkins/Cluster Operations.

Running from a list view

Cluster operations provides a new list view column type called ClusterOp Item Selector and appears by default as the rightmost column on new ListViews and the All view.

all view
Figure 5. Ad-hoc cluster operation

For preexisting list views before cluster operations, you’d need to add the column by editing the view. As with all list views (except the All view) you can change the order of the columns to have them appear in an order more to your liking.

Mark the Client Masters that you want to run a operation on by ticking the appropriate checkbox in the Op column; the selection on each view will be remembered throughout your session.

In the left panel, there is a menu task named Cluster Operations. Hovering the mouse cursor over that task will reveal a clickable context menu marker. Click on it to open the context menu that contains the available operations for the view.

select run adhoc
Figure 6. Ad-hoc cluster operation

Clicking the Cluster Operations link in the side panel will take you to a separate page with the same operations list.

run adhoc listpage
Figure 7. Ad-hoc cluster operation

On the separate list page you can get to the project page of the preset operation (if you are an administrator) by clicking the gear icon next to the operation name.

Clicking the operation’s name, either via the context menu or the separate list page, will take you to the run page, where you are asked to confirm running the operation, or if the operation requires parameters, they will be presented there. The run page also contains the list of selected Client Masters, with those not applicable for this run shown with a strike-out through their names, and a simple explanation of why, either because a given master is the wrong type for the operation or because a configured filter removed it from the resource pool. Some operations, for example, are only designed to run on online masters, so any offline masters will be filtered out.

ad hoc run now
Figure 8. Ad-hoc cluster operation
The list of masters and whether operations will run on them or not are part of a preliminary display; the list will be recalculated once an operation actually runs. Accordingly, Client Masters might get back online or could go offline in the interval between the display and when you choose to actually run the operation.

Running from a Client Master manage page

The same procedure as the previous section can be done on single Client Masters. The way to do it is the same except that because you are only operating on a single Client Master, no selection on a list view is involved.

ad hoc from client master
Figure 9. Ad-hoc cluster operation

Operation run results and logs

Each run of a cluster operation project is accessible from the project page in the left panel like any normal Jenkins project. On the run page you can see the operations that were executed, the items (Client Masters or update centers) that they ran on, and the result in the form of a colored ball (success/failure) as well as a link to the log files for each run.

run page
Figure 10. Cluster operation run page

Console Output in the left panel shows the overall console log for all operations. To see the individual console output of each operation on a Client Master or update center you can go via the (log) link next to each item on the run page or via a link for each in the overall console output.

Changing Domain Name and Enabling SSL

Initial Domain Names or IPs After Installation

After the CloudBees CI cluster is running take note of the generated domain name or IP addresses, depending on the infrastructure where it is running.

AWS

On Amazon AWS it will look like this:

Controllers: ec2-52-91-242-61.compute-1.amazonaws.com,ec2-52-90-204-3.compute-1.amazonaws.com,ec2-54-173-172-182.compute-1.amazonaws.com
Workers    : ec2-52-90-206-206.compute-1.amazonaws.com,ec2-54-88-252-108.compute-1.amazonaws.com

Operations Center: http://core-controller-1009100199.us-east-1.elb.amazonaws.com.elb.example.net/cjoc/
Mesos: http://mesos.core-controller-1009100199.us-east-1.elb.amazonaws.com.elb.example.net
Marathon: http://marathon.core-controller-1009100199.us-east-1.elb.amazonaws.com.elb.example.net
...

The name of the ELB is the section after http://cjoc. (example: core-controller-1009100199.us-east-1.elb.amazonaws.com.elb.example.net) and that is the name the DNS records have to point to.

OpenStack

In OpenStack, you can take the IP addresses for the controllers and create one or more A records for them instead of CNAME records.

Controllers: 192.0.2.109,192.0.2.110,192.0.2.111
Workers    : 192.168.2.44,192.168.2.42

CJOC    : http://192.0.2.109.nip.io/cjoc/
Mesos   : http://mesos.192.0.2.109.nip.io
Marathon: http://marathon.192.0.2.109.nip.io

The controller IPs in this example would be 192.0.2.109, 192.0.2.110 and 192.0.2.111

Change Domain Name and/or Enable SSL

When the domain-name-change operation is carried out, it is assumed that the new domain name is operational and is being served by the local DNS server. Please work with your Operations department to create a new domain.

If enabling SSL, it is assumed that the certificates are valid for the machine running the installation.

The DNS domain can be changed with the operation: domain-name-change

$ cje prepare domain-name-change
domain-name-change is staged - review domain-name-change.config and edit as needed - then run 'cje apply' to perform the operation.

Edit the domain-name-change.config file to specify the new domain name and related parameters. For more information about the configuration parameters see Domain Options and SSL Configuration sections. Then apply the operation with cje apply.

The DNS records need to be setup before executing cje apply.
This operation requires down-time because several applications will be re-configured and restarted.

Domain options guide

You want to run your CloudBees CI service under your own domain. CloudBees CI uses URL names both for user-facing Jenkins instances, Operations Center, and for internal use.

To do this, you need to have access to your own domain name and access to the DNS settings.

The following options allow you to customize the URLs that will be used to access your cluster services. Depending on what your network administrator lets you do on your network, you may be in one of the following situations.

The target for the DNS records can be:

  • a domain name, in providers such as AWS where an Elastic Load Balancer (ELB) is created, for example, cje-controller-1009100199.us-east-1.elb.amazonaws.com. In this case a CNAME record needs to be created.

  • one or more IPs, in other providers such as OpenStack, for example, 192.0.2.109,192.0.2.110,192.0.2.111. In this case one or more A records need to be created.

You can create subdomains beyond 1 level

All the cluster services will be exposed as subdomains of the domain name you provided.

These snippets need to be tuned to fit your own domain name.

To use the domain cje.example.com, you would use

cluster-init.config excerpt
[tiger]
...
domain_name = cje.example.com
domain_separator = .
DNS record

AWS

cje           IN CNAME  <name of the ELB>.
mesos.cje     IN CNAME  <name of the ELB>.
marathon.cje  IN CNAME  <name of the ELB>.

OpenStack

mesos.cje         IN A <IP of the lbaas instance>.
marathon.cje      IN A <IP of the lbaas instance>.
cje               IN A <IP of the lbaas instance>.

Then your cluster will be available with the following URLs :

  • http://cje.example.com/cjoc

  • http://marathon.cje.example.com

  • http://mesos.cje.example.com

A master named master-1 will be available as http://cje.example.com/master-1.

You cannot create subdomains beyond 1 level

Infrastructure services will be registered using the provided domain as a suffix.

These snippets need to be tuned to fit your own domain name.

For domain cje.example.com, you would use

cluster-init.config excerpt
[tiger]
...
domain_name = cje.example.com
domain_separator = -
DNS records

AWS

  mesos-cje         IN CNAME  <name of the ELB>.
  marathon-cje      IN CNAME  <name of the ELB>.
  cje               IN CNAME  <name of the ELB>.

OpenStack

  mesos-cje         IN A <IP of the lbaas instance>.
  marathon-cje      IN A <IP of the lbaas instance>.
  cje               IN A <IP of the lbaas instance>.

Then your cluster will be available with the following URLs:

  • http://cje.example.com/cjoc

  • http://marathon-cje.example.com

  • http://mesos-cje.example.com

A master named master-1 will be available as http://cje.example.com/master-1.

SSL Configuration

SSL termination can be configured at controller level by configuring the NGINX proxy server with the SSL certificates or in AWS at the ELB level.

It is configured by setting protocol = https and one of the following options.

Controller Termination

Set router_ssl = yes and provide key and certificate files as nginx.key and nginx.cert respectively in the project directory.

AWS ELB Termination

SSL certificates will need to be configured in EC2 and provided via ssl_certificate_id using Amazon Resource Names (ARN) syntax.

CloudBees CI requires a certificate with multiple names : cje.example.com, mesos.cje.example.com, marathon.cje.example.com for the base domain cje.example.com.

AWS IAM certificate example
ssl_certificate_id = arn:aws:iam::123456789012:certificate/some-certificate-name
AWS ACM certificate example
ssl_certificate_id = arn:aws:acm:us-east-1:123456789012:certificate/12345678-aaaa-bbbb-cccc-012345678901

If it is not possible to provide a certificate with multiple names, it is possible to provide multiple certificates. The additional certificates can be set up using ssl_certificate_id_mesos and ssl_certificate_id_marathon options using the AWS ARN syntax.

SSL certificate example (three certificates with one name each)
ssl_certificate_id = arn:aws:acm:us-east-1:123456789012:certificate/12345678-aaaa-bbbb-cccc-012345678901
ssl_certificate_id_mesos = arn:aws:acm:us-east-1:123456789012:certificate/12345678-aaaa-bbbb-cccc-012345678902
ssl_certificate_id_marathon = arn:aws:acm:us-east-1:123456789012:certificate/12345678-aaaa-bbbb-cccc-012345678903
Restart controller

In the event of a controller failure, the controller can be replaced with the 'controller-restart' operation.

$ cje prepare controller-restart
controller-restart is staged - review controller-restart.config and edit as needed - then run 'cje apply' to perform the operation.

Edit the controller-restart.config file and enter the controller name as the [server] name.

Then carry out the operation with cje apply.

This operation will terminate the specified controller and restart a new one. To avoid loss of data, perform this operation only on a multi-controller setup.

Restore a cluster

If an entire cluster fails or you need to re-create a destroyed cluster, you can use the operation 'cluster-recover' to recover the cluster as long as you still have the PROJECT directory.

If you do NOT have the PROJECT directory anymore, you will have to re-create a new cluster following the standard initialization steps. If using EBS storage on AWS, use the same cluster_name to recover CJOC and masters data.
$ cje prepare cluster-recover
cluster-recover is staged - review cluster-recover.config and edit as needed - then run 'cje apply' to perform the operation.

Edit the cluster-recover.config file before executing the operation to specify the configuration directory path to recover.

[pse]

## Cluster configuration directory path to recover
# path relative to the PROJECT directory
dna_path=.dna
...
  • In the case of a cluster failure, the configuration directory path (dna_path) will be the default .dna hidden path.

  • In the case of a recovery of a destroyed cluster, specify the destroyed dna path. The destroyed path is usually a hidden path like .dna-destroyed-DATESTAMP. You can list the hidden path with ls -alt command.

  • If the recovery is done from a different machine or bastion host, the access to the administrative ports. need to be updated via the tiger_admin_port_access parameter.

Then apply the operation with cje apply command.

Migrate an entire EC2 cluster

When using AWS/EC2, you should use the EBS service for persistence:

[castle]
storage_server = ebs://

This means that all long-term state information is stored in EBS volumes and snapshots. When you run cje destroy, these volumes and snapshots are left in place.

You can use the EBS persistence feature to migrate between Amazon regions. You can run cje destroy in one region, and then tell Amazon (through the console, or CLI) to copy all snapshots to your new region. You can then run the cje cluster-init operation with cluster-init.config changed to reflect the new region you want to run your cluster in (note that you will have to follow the initialization steps to setup DNS records).

Updating access from selected IPs

Both the admin_port_access parameter, controlling the admin access from selected IPs, and the user_port_access parameter, controlling the user access from selected IPs, can be updated with the access-port-update operation.

$ cje prepare access-port-update
access-port-update is staged - review access-port-update.config and edit as needed - then run 'cje apply' to perform the operation.

Both access port parameters can contain one or more IP address ranges in CIDR notation separated by commas (for example, 192.0.2.0/24,198.51.100.1/32).

Use 0.0.0.0/0 to allow connections from any IP, or IP/32 (for example 198.51.100.1/32) to allow access from a single IP (for example: 198.51.100.1). Other CIDR network masks can be used to control wider ranges of IPs.

Then carry out the operation with cje apply.

This operation on access port parameters only applies if your product is not already using network information that you supplied during initial configuration.

Updating Operations Center parameters

To update Operations Center parameters, use the cjoc-update operation.

$ cje prepare cjoc-update
cjoc-update is staged - review cjoc-update.config and edit as needed - then run 'cje apply' to perform the operation.

Edit the cjoc-update.config file to define the parameters you want to change. Only uncomment the parameters you want to change.

This operation allows to: (see cjoc-update.config file for a complete list of parameters)

  • Enable/disable the evaluation mode

  • Set the Operations Center container memory

  • Set the Operations Center application JVM options

  • Set the Operations Center workspace disk size

  • Set a custom Operations Center docker image

Enabling/updating EC2 Container Registry (ECR) configuration

To update the ECR configuration, use the ecr-update operation.

$ cje prepare ecr-update
ecr-update is staged - review ecr-update.config and edit as needed - then run 'cje apply' to perform the operation.

Edit the ecr-update.config file to define the parameters you want to change.

This operation allows to: (see ecr-update.config file for a complete list of parameters)

  • Enable usage of the default AWS EC2 Container Registry

  • Enable AWS EC2 Container Registry for specific accounts

Scripting cluster operations

CloudBees CI "operations" commands have a lifecycle. To execute an operation, you first need to stage/prepare the command. This stage lays down a configuration file that contains the input parameters of the operation.

By default the config file is edited by the admin user. To facilitate the scripting of cli operations, there are two ways to specify operations values via the cje prepare command:

  1. Config/secrets file arguments

  2. Operation parameter arguments

Note that both type of arguments can be used together if necessary. In this case, the parameter value arguments will overwrite the values specified in the config file. Also, only config file parameters can be specified as arguments. If the OPERATION requires secrets, a secrets file can be specified with the --secrets-file SECRETS-FILE argument.

Config/Secrets file arguments

To use a specific config file and/or secrets file for the operation, use the following options of the cje prepare command. See cje prepare -h for all options

  • --config-file CONFIG_FILE

    • cje prepare --config-file CONFIG_FILE OPERATION

  • --secrets-file SECRETS-FILE for operations that require secrets inputs

    • cje prepare --secrets-file SECRETS-FILE OPERATION

Operation parameter arguments

Operation parameters can also be specified as arguments to the cje prepare command. Use cje prepare OPERATION -h to get the list of parameters available for the specified OPERATION.

For example for the worker-add (Add worker(s) to the cluster) operation, the arguments available are:

$ cje prepare worker-add -h
usage: tiger prepare [-h] [-p DIR] [--config-file CONFIG-FILE]
                     [--secrets-file SECRETS-FILE]
                     [--aws.worker_instance_type VAL]
                     [--aws.worker_volume_size VAL]
                     [--worker.count VAL]
                     worker-add

Prepares an operation.

Prepared operations must be configured and then applied using the apply command.

  worker-add            Prepare worker-add.

optional arguments:
  -h, --help            show this help message and exit
  -p DIR, --project DIR
                        Directory containing the project files
  --config-file CONFIG-FILE
                        Use the specified config file
  --secrets-file SECRETS-FILE
                        Use the specified secrets file
  --aws.worker_instance_type VAL
                        The instance type of the worker to create.
  --aws.worker_volume_size VAL
                        The instance root volume size
  --worker.count VAL    Number of workers to add

For example, to add 2 workers using a m4.xlarge instance type on AWS, you can specify the worker count and the instance type as arguments:

$ cje prepare worker-add --worker.count 2 --aws.worker_instance_type m4.xlarge
worker-add is staged - review worker-add.config and edit as needed - then run 'cje apply' to perform the operation.
If cje apply fails, ensure that the machine has valid name servers configured. Running bees-pse apply without a valid name server configuration, or without the host(1) utility installed, will cause the script to fail when resolving addresses.

The worker-add.config file would be pre-populated with the values specified as arguments and ready for cje apply.

worker-add.config file content after the prepare with arguments command:

[worker]

## Number of workers to add
count = 2

[aws]

## The instance type of the worker to create.
# Leave empty to use default value
#
worker_instance_type = m4.xlarge

## The instance root volume size
# worker_volume_size =