HA (active/passive) Cluster doesn’t start when the messaging database is corrupted

Issue

In an HA configuration, you can see this stacktrace during the start process in the logs:

2016-08-24 11:06:12.763-0400 [id=53]	SEVERE	jenkins.InitReactorRunner$1#onTaskFailed: Failed Messaging.afterExtensionsAugmentedjava.lang.Error: java.lang.reflect.InvocationTargetException
	at hudson.init.TaskMethodFinder.invoke(TaskMethodFinder.java:110)
	at hudson.init.TaskMethodFinder$TaskImpl.run(TaskMethodFinder.java:176)
	at org.jvnet.hudson.reactor.Reactor.runTask(Reactor.java:282)
	at jenkins.model.Jenkins$8.runTask(Jenkins.java:926)
	at org.jvnet.hudson.reactor.Reactor$2.run(Reactor.java:210)
	at org.jvnet.hudson.reactor.Reactor$Node.run(Reactor.java:117)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.reflect.InvocationTargetException
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:497)
	at hudson.init.TaskMethodFinder.invoke(TaskMethodFinder.java:106)
	... 8 more
Caused by: java.io.IOError: java.io.EOFException
	at org.mapdb.Volume$FileChannelVol.getDataInput(Volume.java:1011)
	at org.mapdb.Volume$FileChannelVol.getDataInput(Volume.java:781)
	at org.mapdb.StoreDirect.get2(StoreDirect.java:469)
	at org.mapdb.StoreWAL.get2(StoreWAL.java:336)
	at org.mapdb.StoreWAL.get(StoreWAL.java:320)
	at org.mapdb.Caches$HashTable.get(Caches.java:246)
	at org.mapdb.EngineWrapper.get(EngineWrapper.java:58)
	at org.mapdb.BTreeMap.<init>(BTreeMap.java:541)
	at org.mapdb.DB.getTreeMap(DB.java:805)
	at com.cloudbees.opscenter.context.Messaging$Local.open(Messaging.java:611)
	at com.cloudbees.opscenter.context.Messaging$Local.access$400(Messaging.java:541)
	at com.cloudbees.opscenter.context.Messaging.open(Messaging.java:484)
	at com.cloudbees.opscenter.context.Messaging.afterExtensionsAugmented(Messaging.java:59)
	... 13 more
Caused by: java.io.EOFException
	at org.mapdb.Volume$FileChannelVol.readFully(Volume.java:947)
	at org.mapdb.Volume$FileChannelVol.getDataInput(Volume.java:1008)
	... 25 more

Environment

Resolution

The messaging databases got corrupted, avoiding the cluster restart, and cannot be recreated automatically.

Stop all HA nodes.
Backup and remove the files $JENKINS_HOME/messaging, $JENKINS_HOME/messaging.p and $JENKINS_HOME/messaging.t.
Start the cluster.

Resulting Issue that needs to be fixed

Because we are deleting this messaging database, the messaging from the Operations Center and the controller will be out of sync. We need to make sure to correct that issue.

Run this script on Manage Jenkins> Script console. This script will print out a list of connected controllers and their respective instance ID.
Find the instance ID of the controller you just removed the database file for, and then look for that instance ID in the section beginning with `maxPulls: `
Record the number which comes up here: `- $INSTANCE_ID: $NUMBER `
Run this script on the Manage Jenkins> Script console of the controller:

import com.cloudbees.opscenter.context.Messaging;

println Messaging.getInstance().local.outboxSequenceId
Messaging.getInstance().local.outboxSequenceId.set($NUMBER+1);
println Messaging.getInstance().local.outboxSequenceId

and set the $NUMBER value to the one found from the CJOC script above.

This will get your controller outboxSequenceId back to what it was before removing the database file.