Implemented automatic node-failover
We have implemented a fully automated system on our backup server that will start our backup node once the system determines that the main node is no longer reachable.
Since March 2019 we have had a backup server ready, but in case of a failure of the main node (mainly due to internet or power failures), we had to manually start the backup node in case of need. This of course introduces a delay between the discovery of the event until the backup node is up and running.
We thought there must be a better way, and coded a system to fully automate the checking, starting and stopping of the node.
Our main LTO node is checked every 2 minutes to see if it is still available. If the system determines it is not available, a last check will be performed 1 minute later. If the main node is still not available, the backup node is started. The time to start the backup node and for this node to be available on the LTO network takes less than a minute*).
The system will check every 2 minutes if our main node is available again. If it is, the backup node will be stopped, so our main node will be the one node mining again.
Our monitoring service cannot be automatically switched over from main node to backup node, so we had to try and implement a work-around. After some testing it looks we have managed to come up with a solution. It does mean we had to start a new monitor Jun-06-2020.
As you can see from our implementation above, when our main server is unreacheble, the maximum 'down time' would be approximately 4 minutes.
*) We keep a synchronized copy of the LTO blockchain on both of our servers.