Issue
This issue can present itself in different ways, for example:
-
When you start up a third node of a cluster, the other two nodes restart unexpectedly.
-
When you start up a specific combination of two nodes, they fail to form a cluster, and act as standalone instances.
The error message to check for in the commander.log
file (the log paths are documented here) that indicates this issue is:
2020-06-13T12:46:34.854 | cmdr2 | ERROR | .ActiveMQServerImpl$3@79d6051) | | | | client | AMQ214016: Failed to create netty connection io.netty.channel.AbstractChannel$AnnotatedNoRouteToHostException: No route to host: /${IP_ADDRESS}:5445
Note that the port number :5445
at the end of that log line could also be one of the other port numbers that is used for Zookeeper/Hornetq/ActiveMQ/JGroups communication as documented in Port usage which are port numbers 5445 through 5449.
Environment
-
CloudBees CD (CloudBees Flow) running in clustered mode (more than one server) on Traditional Platforms
Resolution
Port numbers 5445 through 5449 are used for Zookeeper/Hornetq/ActiveMQ/JGroups communication as documented in Port usage, and there is most likely a networking issue between the servers on those ports.
The way to confirm which port is blocked between which servers is to:
-
Login to each of the servers in the cluster (eg
ssh
into each server). -
Run the following command to test connectivity from each server to all other servers in the cluster on these ports:
telnet $OTHER_SERVER_HOSTNAME_OR_IP 5445 telnet $OTHER_SERVER_HOSTNAME_OR_IP 5446 telnet $OTHER_SERVER_HOSTNAME_OR_IP 5447 telnet $OTHER_SERVER_HOSTNAME_OR_IP 5448 telnet $OTHER_SERVER_HOSTNAME_OR_IP 5449
If, for any of the above commands, you recieve the following output (or similar networking errors), then we know there is a networking issue that needs to be resolved:
telnet: Unable to connect to remote host: No route to host
If any of those ports are blocked (for example by the OS, firewall, or some other active networking element), that will need to be re-configured to allow connectivity between all the servers across these ports. Once the network connectivity is restored, this issue should no longer ocurr.