KBEC-00484 - CloudBees CD System Outage Triage Guide

Article ID:360057301572
5 minute readKnowledge base

Summary

You are experiencing a system outage and need immediate assistance. The information in this guide will describe what information and logs should initially be provided to CloudBees support.

By following this guide, you will reduce communication cycles and make the overall debugging process easier.

General Questions

  • What changed in the environment that led to this system outage?

  • What is the business impact of this issue?

  • What version of CloudBees CD are you using?

  • What OS and Kernel version are the affected machines using?

  • Does this happen everywhere or just in one instance? Which instance(s)?

  • Is the problem intermittent or reproducible?

  • What are the reproduction steps (if applicable)?

Log File Locations

It’s very likely that log analysis will be required to troubleshoot any system outage. Please refer to the following KB article to find default log file locations:

Installation or Upgrade Problems

You should send Support the following logs:

  • install.log or uninstall.log

    • Might be located in /tmp; if so, the installer output will list the file path location

  • commander.log if the installation is for the Server

  • agent.log + jagent.log if the installation is for the agent

Other areas to explore:

  • Might there be a disk space shortage on the Server or DB Server?

  • Check if there are special considerations for upgrading to the version in question that may have been missed.

  • Are you upgrading from a very old version to a newer version? If you are unsure, please detail both the starting version and final version (post-upgrade).

  • Do you have a database dump from before the upgrade that you can provide? This is not always required but sometimes CloudBees support will try to test this upgrade in-house.

  • Is the upgrade taking longer than expected?

    • Are you upgrading from a very old version to a newer version?

      • If so, there will be numerous schema updates being handled during this upgrade attempt

      • Check commander.log files to confirm if progress is still being made - then just be patient

      • Generally, it is a good practice to test the upgrade in a test environment first to estimate the upgrade duration

    • Did someone re-activate Change Tracking for a very large project?

  • Are you seeing constraint violation errors reported in the commander.log files?

  • Uninstall failing?

    • Typically some command or browser window is left open in a directory under the install path - clear out those windows and then try to repeat the process again.

Database Configuration

  • You will likely need to work with your DBA to help troubleshoot this issue

  • MySQL: Is the jdbc driver installed?

  • Is the database using UTF8 encoding?

  • Is the database using the right locale? Typically the ENGLISH language setting is needed for the database to work properly.

Plugin Issues

  • "My upgrade installed a newer version of plugin XX" - that’s not working anymore - how can I get the old version back?

    • There is 1 version of the plugin that is promoted - so you should be able to demote the newest version and promote again the older version

    • If the older plugin version is not available, please inform CloudBees support

License Issues

  • Production license on a system using the built in MariaDB database?

    • A production license requires a production DB

    • The built-in Maria DB is not officially supported for production use with CD

    • You may need to ask Support to deliver you a temporary "eval" license to fix this until you are able to set up an external database

  • Migrating to a registered host license from a concurrent host license

    • All resources will be converted to "registered" which may be slow if there are a large amount of resources in the system (>200).

    • Need to complete a pingAll operation to all end points after this license update, which can take time

  • Banner warning after applying license file

    • After applying license, you may see a banner warning that # of hosts exceeds what you are licensed for. All hosts will still work, but you cannot add any new hosts until you get below your registered host count

  • For any license issue not described above, please send your license and details of the issue to CloudBees Support to see if the issue can be reproduced

Certificate Setup Issues

  • There are a number of KB articles for various issues related to applying certificates that can be referenced based on the type of problem being encountered

  • If you are unable to search for the specific certificate issue you’re facing, please inform CloudBees Support

Login Issues

  • Check if commanderServer service is running

    • Linux: Run /etc/init.d/commanderServer status

    • Windows: Check Services window

  • Check if LDAP/AD service is having any internal issues

  • Collect commander.log files covering a time when a login was attempted

  • If this is tied to recent upgrade/system move - then also describe how your passkey file was transferred and outline the version-from and version-to details

Disk Space

  • Check KBEC-00104 if any of the suggestions there are applicable

  • Check if system has enough disk space

  • Depending on error type, this kind of problem could be a CD Server space issue, or a Workspace issue for the agents involved. So clearing out older log files may help stabilize things

  • Server

    • CloudBees CD users can sometimes underestimate the amount of disk space the commander.log files will require

    • Default collection is 30 days, which can result in many GB if the server has started being used more heavily

    • Our Disk Usage documentation describes how you can configure log rotation to limit the amount of disk space used for logging.

  • Agent

    • Workspaces are typically pointing to some designated filer space

    • Some users may let this grow for 1-2 years before seeing a problem, with no effort to clean it up

    • Deleting older log files is usually safe

    • Inform CloudBees support what your cleanup policies are, and what you are seeing, to receive some suggestions on how to control your disk space demands

Agents Unreachable

  • Was the server recently restarted?

  • Have you tried to use the ping operation from the UI or API?

  • Can you telnet from the server to the agent machine directly (re - network issue?)

  • Has a separate server been recently activated that is using the same agents?

Performance Problems

  • Check KBEC-00480 if any of the suggestions there are applicable

  • Provide the following logs:

    • commander.log

    • commander-service.log

    • agent.log and jagent.log if slowness is on the agent side