ef5a9809f2
The failback process needs to be deterministic rather than relying on various incarnations of Thread.sleep() at crucial points. Important aspects of this change include: 1) Make the initial replication synchronization process block at the very last step and wait for a response from the replica to ensure the replica has as the necessary data. This is a critical piece of knowledge during the failback process because it allows the soon-to-become-backup server to know for sure when it can shut itself down and allow the soon-to-become-live server to take over. Also, introduce a new configuration element called "initial-replication-sync-timeout" to conrol how long this blocking will occur. 2) Set the state of the server as 'LIVE' only after the server is fully started. This is necessary because once the soon-to-be-backup server shuts down it needs to know that the soon-to-be-live server has started fully before it restarts itself as the new backup. If the soon-to-be-backup server restarts before the soon-to-be-live is fully started then it won't actually become a backup server but instead will become a live server which will break the failback process. 3) Wait to receive the announcement of a backup server before failing-back. |
||
---|---|---|
.. | ||
en |