Configuring Disaster Recovery and Recovering from a Disaster

8 minute read

This topic has the steps to configure a Disaster Recovery (DR) setup. It explains what to expect after a successful failover.

Disaster Recovery Environment Setup

As part of a DR environment setup, you need to set up a secondary CloudBees Flow site. This site must include a complete setup of all CloudBees Flow components including:

  • Database (Oracle, SQL Server, or MySQL)

  • CloudBees Flow server

  • Web server

  • DevOps Insight Server

  • Repository server

  • Zookeeper

  • Gateway agent

  • Agent

CloudBees Flow components that can be load balanced include the CloudBees Flow server, web server, DevOps Insight server, repository server, and gateway agent. Go to Installing and Configuring a Load Balancer for details.

Along with replicating the component setup, data stores also need to be replicated and made available for CloudBees Flow to operate the following:

  • Database (using the vendor-recommended replication)

  • DevOps Insight server (by restoring snapshots of Elasticsearch indices)

  • Repository Server Data Store (replication of shared file locations where artifacts are stored)

  • Plugins (typically copied to a shared file location)

  • Certificates signed by a CloudBees Flow CA (Certificate Authority)

Configurations and Settings

Follow these guidelines on the primary and secondary sites:

  • Both the primary and secondary sites should be running the same version of CloudBees Flow.

  • Under normal operation, it should not be possible to do active transactions directly on the secondary (or replicated) database. That is, the secondary database should be in replication mode.

  • The CloudBees Flow server in the secondary site must be configured and must point to the secondary (or replicated) database.

  • These servers must not automatically restart after a reboot. That is, all CloudBees Flow servers in the secondary site should be set to start in Manual mode. This prevents inadvertent write operations into the replicated database. Details on recommended steps for setting up a secondary site appear below.

    • CloudBees Flow server

    • Web server

    • DevOps Insight server services (Elasticsearch service and Logstash service)

  • We recommend using DNS Failover to minimize the downtime when moving from the primary site to the secondary site. This allows end users accessing the web servers to continue using the same URL. Agents that are running jobs can send their finish job notification to the secondary server, allowing the jobs to succeed.

To configure Disaster Recovery, follow these steps:

  1. Add following lines to the wrapper.conf file for the server nodes:

    wrapper.java.additional.1600=DCOMMANDER_IGNORE_SERVER_MISMATCH=1
    wrapper.java.additional.1601=DCOMMANDER_PRESERVE_SESSIONS=1

    This avoids the following errors during the failover to the secondary site:

    20160815T22:10:52.406 | 10.0.2.206 | DEBUG | bootstrap | | schemaMaintenance | OperationInvoker | Exception: InvalidServer: The ZooKeeper/Exhibitor setting of the last cluster ('ip100179155:8080,ip100145251:8080,ip100133239:8080') to connect to the database is different from the ZooKeeper/Exhibitor setting of the current cluster ('ip1008088:8080,ip100196103:8080,ip100515:8080'). Check that the cluster is configured for the correct database. To allow this server cluster to become the new owner for this data, update the database configuration with the ignoreServerMismatch flag set. 20160815T22:10:52.406 | 10.0.2.206 | WARN | bootstrap | | | | ServerStatus | InvalidServer: The ZooKeeper/Exhibitor setting of the last cluster ('ip100179155:8080,ip100145251:8080,ip100133239:8080') to connect to the database is different from the ZooKeeper/Exhibitor setting of the current cluster ('ip1008088:8080,ip100196103:8080,ip100515:8080'). Check that the cluster is configured for the correct database. To allow this server cluster to become the new owner for this data, update the database configuration with the ignoreServerMismatch flag set.
  2. Make sure that the following recommended standard setup steps are performed:

    1. commander.properties in ZooKeeper should have the COMMANDER_SERVER_NAME set to the FQDN (Fully Qualified Domain Name) of the load balancer of the CloudBees Flow server cluster. This can be set using the following command:

      ecconfigure --serverName <FLOW_SERVER_LOAD_BALANCER_FQDN>
    2. Both the primary and secondary sites should have the same files (including content):

      • keystoreFile

      • passkeyFile

      • commander.properties

        In the cluster setup, these files are stored in their respective ZooKeeper instances (the primary and secondary instances).

        The ZooKeeper connection is configured using: ecconfigure --serverZooKeeperConnection <ZooKeeper_servers_comma_seperated_list> For example: ecconfigure –-serverZooKeeperConnection ip1008088:2181,ip100196103:2181,ip100515:2181
    3. Run the following command on the web servers to ensure that the COMMANDER_SERVER property in the httpd.conf file is set to CloudBees Flow server’s load balancer FQDN:

      ecconfigure --webTargetHostName <FLOW_SERVER_LOAD_BALANCER_FQDN>
      The httpd.conf file is usually in apache/conf on a Linux machine and ProgramData\Electric Cloud\ElectricCommander\apache\conf on a Windows machine.

The --webTargetHostName argument modifies the CloudBees Flow web server configuration and therefore also attempts to restart the CloudBees Flow web server. If you used the ecconfigure command without sudo as recommended, the commanderApache service will not start and produces an error. Therefore, you must restart it manually afterward using sudo. You can also use the --skipServiceRestart argument to avoid the ecconfigure command’s restart attempt and the error message. .. Similarly, on each repository server, run the following command to set COMMANDER_HOST in the server.properties file:

+

ecconfigure --repositoryTargetHostName <FLOW_SERVER_LOAD_BALANCER_FQDN>

+ NOTE: These are the default locations for the server.properties file: - In Linux, /opt/electriccloud/electriccommander/conf/repository/server.properties - In Windows, C:\ProgramData\Electric Cloud\ElectricCommander\conf\repository\server.properties

  1. Set the "Server IP address" in the CloudBees Flow server property to the Flow Server Load Balancer FQDN (the same value as <FLOW_SERVER_LOAD_BALANCER_FQDN> ).

  2. Set the “Stomp Client URI” server property to stomp+ssl://<FLOW_SERVER_LOAD_BALANCER_FQDN>:61613.

    If the load balancer does SSL termination, uncheck the Use SSL for Stomp option. All the CloudBees Flow server nodes must be restarted to affect this change, because by default, the check box is selected.
  3. In the Cloud > Resource page, register all the CloudBees Flow local agents (the ones that run on same machine as CloudBees Flow server) from both primary and secondary sites. Make sure that for the cluster setup, you do not have local resources.

  4. Create resource pools named “ local ” and “ default ”. For both, add CloudBees Flow local agents.

    Both the local and default pools are used by the CloudBees Flow standard job processing. For example, sentry jobs run on local pool resources.

  5. (Optional) If trusted agents are used, in addition to copying the keystore file, along with keystore file, the conf/security folder should be copied to the secondary site’s ZooKeeper. This folder stores the CloudBees Flow Certificate Authority information along with the certificates that are signed by CloudBees Flow.

    Perform the following steps to copy this folder from the primary to secondary site: …​ Log into the primary site’s CloudBees Flow server, and run the following command to get the conf/security folder from the primary ZooKeeper into the local /tmp/CloudBees Flow Automation Platform/conf/security folder: ** Linux:

    + COMMANDER_ZK_CONNECTION=<ZooKeeper_Primary_Server_IP>:2181 <install_dir>/jre/bin/java -cp <install_dir>/server/bin/zk-config-tool-jar-with-dependencies.jar com.electriccloud.commander.zkconfig.ZKConfigTool --readFolder /commander/conf/security /tmp/CloudBees Flow Automation Platform/conf/security

    • Windows:

      "C:\Program Files\Electric Cloud\ElectricCommander\jre\bin\java.exe" -DCOMMANDER_ZK_CONNECTION=<ZooKeeper_Primary_Server_IP>:2181 -jar "C:\Program Files\Electric Cloud\ElectricCommander\server\bin\zk-config-tool-jar-with-dependencies.jar" com.electriccloud.commander.cluster.ZKConfigTool --readFolder /commander/conf/security c:\<path>\CloudBees Flow Automation Platform\conf\security …​ Log into the secondary site’s CloudBees Flow server, and run the following command to upload the conf/security folder from the local folder to ZooKeeper:

    • Linux:

      COMMANDER_ZK_CONNECTION=<ZooKeeper_Secondary_Server_IP>:2181 <install_dir>/jre/bin/java -cp <install_dir>/server/bin/zk-config-tool-jar-with-dependencies.jar com.electriccloud.commander.zkconfig.ZKConfigTool --writeFolder /commander/conf/security /tmp/CloudBees Flow Automation Platform/conf/security

    • Windows:

      "C:\Program Files\Electric Cloud\ElectricCommander\jre\bin\java.exe" -DCOMMANDER_ZK_CONNECTION=<ZooKeeper_Secondary_Server_IP>:2181 -jar "C:\Program Files\Electric Cloud\ElectricCommander\server\bin\zk-config-tool-jar-with-dependencies.jar" com.electriccloud.commander.cluster.ZKConfigTool --writeFolder /commander/conf/security c:\<path>\CloudBees Flow Automation Platform\conf\security . Set the repository data store on the primary and secondary sites.

      + Repository servers set up on primary and secondary sites can share the same repository data store.

      + The repository data can be replicated, and the repository servers can point to the respective data store locations. For each repository server, set the REPOSITORY_BACKING_STORE in the server.properties file to a UNC path on a network share on the file server.

      + For example:

      +

      REPOSITORY_BACKING_STORE=//10.0.109.72/repo_data/repositorydata
      1. Register the repository server in the CloudBees Flow UI. If the repository server cluster is set up, use the load balancer URL (for example, https://<REPOSITORY_SERVER_LOAD_BALANCER_FQDN>:8200 ).

        During the failover to the secondary site, FQDN should point to the repository servers in the secondary site.

      2. It is recommended that the plugins folder in the network share must be accessible from the remote web servers as mentioned in Universal Access to the Plugins Directory .

      3. For the initial installation and setup of the secondary site, perform the following recommended steps:

  6. Set up the secondary database in normal or nonreplicated mode.

  7. Follow instructions as described in this installation guide to set up all the servers, including the CloudBees Flow, web, repository, and Zookeeper servers.

  8. Make sure that the database.properties file is set up to point to the correct secondary site’s database server. This file will be stored in ZooKeeper when the cluster setup is used. For the primary site, it should point to the primary database. For the secondary site, it should point to the secondary database.

  9. Ensure that all the servers are running properly, including the connection to the database. At this time, the secondary site is not set up for replication, and operates as a separate installation of CloudBees Flow.

  10. Before setting up the secondary site’s database in replication mode, shut down all secondary CloudBees Flow servers, web servers, repository servers, and others. After these steps are performed, there should not be any write transactions to the secondary database.

  11. The first time that the secondary database is set up, a schema for database tables is created. Before proceeding to set this database with replication, this schema may have to be deleted. This avoids a schema name conflict when replication is enabled.

  12. Based on the disaster recovery option chosen for the database, set up the secondary database in replication mode.

Disaster Recovery Environment Setup for DevOps Insight Server

Follow the steps below to include the DevOps Insight server in your disaster recovery environment setup.

  1. Setup identical DevOps Insight server installations on the primary site and the secondary sites.

  2. Ensure that the CloudBees Flow server on the primary site is configured to point to the FQDN (Fully Qualified Domain Name) of the load balancer of the DevOps Insight server cluster in the primary site. Similarly, the CloudBees Flow server on the secondary site must be configured to point to the FQDN (Fully Qualified Domain Name) of the load balancer of the DevOps Insight server cluster in the secondary site.

  3. Use snapshots to create backups of indices from the primary site at regular intervals, for example, daily. The backups can be stored in a shared file system or on AWS S3 storage. See the section DevOps Insight Server Elasticsearch Data Backups for details on creating snapshots.

  4. Restore the snapshots to the DevOps Insight server cluster running on the secondary site at regular intervals, for example, daily or weekly. You Sneed to start the Elasticsearch service for the DevOps Insight server on the secondary site for restoring the snapshots. You may choose to move the snapshot files to a different location for backup and archiving purposes once they have been restored on the secondary site.

Steps to Perform During a Disaster Recovery Failover

When a disaster event happens that interrupts the operations on the primary site, follow these steps to move the operations to the secondary site.

  1. Shut down any services that might still be running on the primary site. With the exception of the DevOps Insight server, all other components including the database, CloudBees Flow server, and repository server can be shut down. Doing this ensures that no more transactions happen on the primary site. For the DevOps Insight server see Disaster Recovery Failover Steps for DevOps Insight Server below.

  2. Begin switching operations to the secondary site by restoring and updating the secondary site’s database with the latest data. The steps to do this may vary based on the disaster recovery method used for database.

  3. Delete the brokerdata folder on the CloudBees Flow server nodes. For example, in Windows, delete the C:\ProgramData\Electric Cloud\ElectricCommander\brokerdata folder.

  4. Follow the DNS failover procedure, and update the DNS entries to point to servers in the secondary site. This includes updating the entries for servers including web server, CloudBees Flow server, and repository server. There may also be other servers based on configuration.

  5. Bring up all the servers, with the exception of the DevOps Insight server, and infrastructure in the secondary site. Start the CloudBees Flow services on different machines, including the CloudBees Flow server, web server, repository server, and gateway agents. For DevOps Insight server see Disaster Recovery Failover Steps for DevOps Insight Server below.

  6. Based on the nature of the disaster event, certain active operations running on the primary site may be interrupted, and need to be restarted. For example, a build or deploy application process may fail and error out. Use the CloudBees Flow UI to review such failures, and take the appropriate corrective actions, usually by executing those failed processes again.

  7. The secondary’s site database now acts as the master. We recommend that you set up a database that will act as a slave.

Disaster Recovery Failover Steps for a DevOps Insight Server

  • If the Elasticsearch service for the DevOps Insight server is still running on the primary site, then create a final snapshot before shutting down the service.

  • Restore any snapshots created since the last scheduled restoration on the secondary site.

The secondary site’s DevOps Insight server now acts as the primary cluster. We recommend that you setup a schedule for creating snapshots from this cluster and restoring into another cluster.

Server Maintenance

See Maintenance for details on CloudBees Flow server maintenance.