KBEC-00508 - An overview of the getServerStatus output

Issue

The ectool getServerStatus --diagnostics 1 output contains lots of information - how can I use this?

Environment

CloudBees CD (CloudBees Flow) versions 10.3 and below.

Solution

The API for getServerStatus has 2 purposes:

A) Simple Server Status Check:

If you run getServerStatus directly from a machine that has the CD server software installed, and you have not logged in using ectool login <username> <password>, you will receive a simple output status with either value of 1 (up and running) or 0 (not up).

B) Deliver Diagnostic information:

Here is a summary of the sections of output provided and how to read this information. All data presented here is either fixed data, based on the current snapshot in time, or collective data since the last time this CD server was restarted.

1) ACL Statistics

a brief set of data concerning caching of ACLs
This is not likely to be relevant in most cases

2) Agents

This section provides information about recently used agents (cachedAgents) and their current status
This section may be useful in rare case when trying to understand the current status of an agent inside the system

3) apiMonitor

This section delivers information about the longest Call, active API calls, and recent API calls
This is not likely to be relevant in most cases, but could be used when hunting down jobs that appear to be stuck

4) clusterManager

This section provides data on Active MQ content
This is not likely to be relevant in most cases

5) clusterSingletonServices

This information identiies which server nodes are running which of the singleton services employed by the CD service. On a single node system, these will all be on the 1 node, but in an HA or multi-node CD system, these services are distributed to running nodes to help balance system workload.
This information can be useful for a user in identifying which node is running a service, like the backgroundDeleter or the step scheduler.

6) entityCounts

This section lists each table used in the CD database schema and the number of elements stored therein.

eg: This block of data shows that there are over 0.5M jobs in the current database:

          <entityCount>
            <name>job</name>
            <value>523819</value>
          </entityCount>

These metrics can be useful to users in understanding the overall scope of content being stored in the system, or when used as a comparison with past collections, in understanding the extent to which areas of the databas may be growing. Such questions may come into play when performance changes are being observed over time. Reviewing this data can help to recognize how trimming certain data types may provide some performance improvements.
When viewing this data, relative sizing can be more important than the sheer number.
eg: 1M properties may not bring any concern, as each property is essentially a leaf node, whereas 10K pipelines might be encouraging some performance impacts in viewing pipelines in some environments, since a pipeline contains numerous children objects (stages, tasks, properties, parameters, etc…).

Some of the more common elements to take note of would be:

1) entityChange - the # of items stored for the Change Tracking feature 2) flowRuntime - the # of pipeline runs (re - vs "pipeline" which shows the # of pipeline definitions) 3) job -the # of jobs 4) jobStep - the # of jobSteps across all jobs 5) property - the # of properties stored 6) propertySheet - the # of propertySheets or folders that are collecting other children properties - comparing the ratio against the property count can be informative in terms of how teams are storing data 7) resource - the # of defined resources currently in the database

7) entityLocks

This data is typically more of interest to the CloudBees engineering team

8) environment

This section lists relevant system environment variables picked up at the time the server was started

9) lazywiredPostProcessor

This data is typically more of interest to the CloudBees engineering team

10) license

This lists some of the key items recorded inside the license file and current usage

11) memory

This lists some data useful in summing up garbage collection events since the server was last started

12) memoryPools

this data maps to the same that is shown during a memoryGuard output in the typical log output (every 15 minutes by default)

13) messageServiceStats

Some statistics more relevant to the CloudBees Engineering team

14) propertyTracker

This section identifies any properties that are currently being monitored
These are most typically associated with logic used in a precondition, or a retry setup

15) quartzScheduler

Lists schedules that are being monitored for use with Sentry

16) queue

Various queues used in the system and their default sizes

17) serverInfo

A snapshot of some system settings

18) services

Lists the status of the numerous services that sit "under the covers" for the CD system

19) sessionManagerStatistics

Data more relevant to the CloudBees Engeering team

20) settings

Lists various system setting values
This same list can be found on the Server Administration Settings of the UI

21) statistics

This section can be beneficial when exploring general performance concerns
This section contains 2 sub-sections:

a) All timers by name:

Sample:

Name                                          Count       Mean     StdDev        Min        Max          Sum
---------------------------------------- ---------- ---------- ---------- ---------- ---------- ------------
setProperty.queue                              6692     86.454    399.556      0.000   4292.000   578551.000
setProperty.perform                            7715     79.359    124.705      3.000   4869.000   612251.000
setProperty.post                               7715      0.147      9.171      0.000    630.000     1132.000
setProperty.last                               7715      0.000      0.000      0.000      0.000        0.000

The count represents the number of times this operation occured. The other columns are the times involved, represented in milliseconds.

The columns of most interest tend to be a combination of the Count and Mean entries.

Typically values over 1000 (1s) tend to be a starting point for potential further exploration. However, if the Count for such items is very small, these may be startup operations which are not going to be something of concern, or so rarely executed that they would not be considered a major drag on a running system.

eg: Commands like setProperty taking >1s on average would point to some issue with how the DB is used, or perhaps that the underlying network environment is performing poorly.

From the sample above, typically the .PERFORM number is the one that is most telling of system performance.

If the mean times of numerous operations seem large this may sugget an overall DB issue or system load issue which may require further analysis to uncover ways to mitigate the problem.

b) Timers by roots:

This is similar to the section above, but it provides a more detailed breakdown of sub-elements involved with the operation.
These extra level of details can be useful in identifying whether the CD software or some 3rd party element may be running imperfectly, or hampered in some way:

eg: Below we can see that the getGroup operation is taking on average over 15s to complete. Howevr, by looking into the breakdown, we also can identify that the bulk of that average time is being spent waiting for the company LDAP service to return with GroupMembers data:

| | | getGroup.perform                                               92  14884.630  61249.004      0.000 349717.000  1369386.000
| | | | SessionAspect                                                92      0.000      0.000      0.000      0.000        0.000
| | | | PermissionAspect.validatePermission                          84     36.536    212.185      3.000   1489.000     3069.000
| | | | | PermissionAspect.findPermissionAnnotations                 84      0.000      0.000      0.000      0.000        0.000
| | | | | PermissionAspect.evaluate                                  83      0.000      0.000      0.000      0.000        0.000
| | | | | provider.companyLDAP.findFilteredGroups                       2   1374.500    146.371   1271.000   1478.000     2749.000
| | | | LicenseAspect                                                84      0.000      0.000      0.000      0.000        0.000
| | | | provider.companyLDAP.findFilteredGroups                        18   1454.889     98.847   1195.000   1573.000    26188.000
| | | | PermissionAspect.evaluate                                    83      0.000      0.000      0.000      0.000        0.000
| | | | provider.companyLDAP.findGroupMembers                          83  16098.325  64135.254   3426.000 348223.000  1336161.000

22) stringExpanderStatistics

Typically useful for the CloudBees Engineering team
Identifies the number of times a string is being expanded leveraging a cache

23) systemProperties

These items map to data picked up at system boot time from configuration files or configured at the time of installation

24) threadDump

In cases where a hanging event is seen inside the system, this data may be useful to the CloudBees engineering team

25) transactionRetries

Data on the number of transactionRetries in the database

26) transactionState

Data on the number of transactions performed by the database

27) versionHistory

The list of historical versions that were used on this system prior to the current version in use
This can help expose the upgrade path taken with this particular CD server machine

28) messages

Typically useful for the CloudBees Engineering team

29) serverState

Typically just "running" when viewing this full output

30) serverVersion

Details on the current version in use