KBEC-00451 - TCP port connectivity errors affecting clustered servers

Issue

This issue can present itself in different ways, for example:

When you start up a third node of a cluster, the other two nodes restart unexpectedly.
When you start up a specific combination of two nodes, they fail to form a cluster, and act as standalone instances.

The error message to check for in the commander.log file (the log paths are documented here) that indicates this issue is:

2020-06-13T12:46:34.854 | cmdr2 | ERROR | .ActiveMQServerImpl$3@79d6051) | | | | client | AMQ214016: Failed to create netty connection io.netty.channel.AbstractChannel$AnnotatedNoRouteToHostException: No route to host: /${IP_ADDRESS}:5445

Note that the port number :5445 at the end of that log line could also be one of the other port numbers that is used for Zookeeper/Hornetq/ActiveMQ/JGroups communication as documented in Port usage which are port numbers 5445 through 5449.

Environment

CloudBees CD (CloudBees Flow) running in clustered mode (more than one server) on Traditional Platforms

Resolution

Port numbers 5445 through 5449 are used for Zookeeper/Hornetq/ActiveMQ/JGroups communication as documented in Port usage, and there is most likely a networking issue between the servers on those ports.

The way to confirm which port is blocked between which servers is to:

Login to each of the servers in the cluster (eg ssh into each server).
Run the following command to test connectivity from each server to all other servers in the cluster on these ports:

telnet $OTHER_SERVER_HOSTNAME_OR_IP 5445
telnet $OTHER_SERVER_HOSTNAME_OR_IP 5446
telnet $OTHER_SERVER_HOSTNAME_OR_IP 5447
telnet $OTHER_SERVER_HOSTNAME_OR_IP 5448
telnet $OTHER_SERVER_HOSTNAME_OR_IP 5449

If, for any of the above commands, you recieve the following output (or similar networking errors), then we know there is a networking issue that needs to be resolved:

telnet: Unable to connect to remote host: No route to host

If any of those ports are blocked (for example by the OS, firewall, or some other active networking element), that will need to be re-configured to allow connectivity between all the servers across these ports. Once the network connectivity is restored, this issue should no longer ocurr.

Tested product version

CloudBees Flow v8.0.5