When server sends disconnect to the client, the ClientSession schedules
a close task on it's ordered executor. Once the close method starts
it's waits to check to see if all jobs in it's executor has completed.
To do this it adds a job to it's ordered executor, once it is run it
knows there is nothing more to do and thus is ready to close. However,
this causes a deadlock as both jobs are running in the ordered executor
and thus are both waiting on each other. The close eventually timesout
which is why we see the logs as reported in the JIRA.
This commit runs the close method in it's own ordered executor, thus
preventing the two jobs blocking each other.
The failback process needs to be deterministic rather than relying on various
incarnations of Thread.sleep() at crucial points. Important aspects of this
change include:
1) Make the initial replication synchronization process block at the very
last step and wait for a response from the replica to ensure the replica has
as the necessary data. This is a critical piece of knowledge during the
failback process because it allows the soon-to-become-backup server to know
for sure when it can shut itself down and allow the soon-to-become-live
server to take over. Also, introduce a new configuration element called
"initial-replication-sync-timeout" to conrol how long this blocking will occur.
2) Set the state of the server as 'LIVE' only after the server is fully
started. This is necessary because once the soon-to-be-backup server shuts
down it needs to know that the soon-to-be-live server has started fully before
it restarts itself as the new backup. If the soon-to-be-backup server restarts
before the soon-to-be-live is fully started then it won't actually become a
backup server but instead will become a live server which will break the
failback process.
3) Wait to receive the announcement of a backup server before failing-back.