HA Cluster doesn’t start when messaging database is corrupted

Article ID:227852187
2 minute readKnowledge base

Issue

  • In a HA configuration, you cannot access to UI, showing this error:

Error

Jenkins detected that you appear running more than one instance of Jenkins that share the same home directory '/path/to/jenkins_home'. This is greatly confuses Jenkins and you will likely experience strange behaviors, so please correct the situation.

This Jenkins: 40843708317 contextPath:"/jenkins" 32323@hostname_node1
Other Jenkins: 138947981375 contextPath:"/jenkins" 4223@hostname_node2

Ignore this problem and keep using Jenkins anyway
  • Also you can see this stacktrace during the start process in the logs:

2016-08-24 11:06:12.763-0400 [id=53]	SEVERE	jenkins.InitReactorRunner$1#onTaskFailed: Failed Messaging.afterExtensionsAugmentedjava.lang.Error: java.lang.reflect.InvocationTargetException
	at hudson.init.TaskMethodFinder.invoke(TaskMethodFinder.java:110)
	at hudson.init.TaskMethodFinder$TaskImpl.run(TaskMethodFinder.java:176)
	at org.jvnet.hudson.reactor.Reactor.runTask(Reactor.java:282)
	at jenkins.model.Jenkins$8.runTask(Jenkins.java:926)
	at org.jvnet.hudson.reactor.Reactor$2.run(Reactor.java:210)
	at org.jvnet.hudson.reactor.Reactor$Node.run(Reactor.java:117)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.reflect.InvocationTargetException
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:497)
	at hudson.init.TaskMethodFinder.invoke(TaskMethodFinder.java:106)
	... 8 more
Caused by: java.io.IOError: java.io.EOFException
	at org.mapdb.Volume$FileChannelVol.getDataInput(Volume.java:1011)
	at org.mapdb.Volume$FileChannelVol.getDataInput(Volume.java:781)
	at org.mapdb.StoreDirect.get2(StoreDirect.java:469)
	at org.mapdb.StoreWAL.get2(StoreWAL.java:336)
	at org.mapdb.StoreWAL.get(StoreWAL.java:320)
	at org.mapdb.Caches$HashTable.get(Caches.java:246)
	at org.mapdb.EngineWrapper.get(EngineWrapper.java:58)
	at org.mapdb.BTreeMap.<init>(BTreeMap.java:541)
	at org.mapdb.DB.getTreeMap(DB.java:805)
	at com.cloudbees.opscenter.context.Messaging$Local.open(Messaging.java:611)
	at com.cloudbees.opscenter.context.Messaging$Local.access$400(Messaging.java:541)
	at com.cloudbees.opscenter.context.Messaging.open(Messaging.java:484)
	at com.cloudbees.opscenter.context.Messaging.afterExtensionsAugmented(Messaging.java:59)
	... 13 more
Caused by: java.io.EOFException
	at org.mapdb.Volume$FileChannelVol.readFully(Volume.java:947)
	at org.mapdb.Volume$FileChannelVol.getDataInput(Volume.java:1008)
	... 25 more

Resolution

The messaging databases got corrupted, avoiding the cluster restart, and cannot be recreated automatically.

  1. Stop all HA nodes.

  2. Backup and remove the files $JENKINS_HOME/messaging, $JENKINS_HOME/messaging.p and $JENKINS_HOME/messaging.t.

  3. Start the cluster.

Resulting Issue that needs to be fixed

Because we are deleting this messaging database, the messaging from the Operations Center and the controller will be out of synch. We need to make sure to correct that issue.

  1. Run this script on Manage Jenkins> Script console. This script will print out a list of connected controllers and their respective instance ID.

  2. Find the instance ID of the controller you just removed the database file for and then look for that instance id in the section beginning with `maxPulls: `

  3. Record the number which comes up here: `- $INSTANCE_ID: $NUMBER `

  4. Run this script on the Manage Jenkins> Script console of the controller:

import com.cloudbees.opscenter.context.Messaging;

println Messaging.getInstance().local.outboxSequenceId
Messaging.getInstance().local.outboxSequenceId.set($NUMBER+1);
println Messaging.getInstance().local.outboxSequenceId

and set the $NUMBER value to the one found from the CJOC script above.

This will get your controller outboxSequenceId back to what it was before removing the database file.