Our OCF script for failovers at Hetzner worked flawlessly the last month. Last week, however, a problem arose we did not anticipate. The webservice returns an HTTP statuscode (as is expected from a webserver) and we did not anticipate any HTTP errorcodes.
An HTTP response in the 4XX or 5XX range would kill the python interpreter with a traceback from urllib2 and an exit code of 1, a code which told the OCF script to return $OCF_NOT_RUNNING which caused a failover to occur. This wouldn’t be a problem in a normal operating environment.
Unfortunately, we noticed that the Hetzner failover webservice isn’t totally stable. This happens on both hosts in the failover setup, who will both try to failover and cause havoc. Fortunately, OCF has an errorcode which means a soft fail ($OCF_ERR_GENERIC), we can use this code to tell heartbeat a temporary failure has occurred and it should not failover.
The parse-hetzner-json.py script now has a try-except construction for the HTTP requests and has 3 exit codes:
- 0: Everything OK, I have the failover-IP
- 1: Unknown Error, can’t get status of the failover-IP
- 2: Everything OK, I do not have the failover-IP
The error-codes are then processed by the hetzner-failover-ip OCF script as follows:
${OCF_RESKEY_script} -g -i ${OCF_RESKEY_ip} case $? in 0) return $OCF_SUCCESS ;; 2) return $OCF_NOT_RUNNING ;; *) sleep 30 # Do not DOS Hetzner return $OCF_ERR_GENERIC ;; esac
The sleep 30 is required, as too many requests to the Hetzner failover webservice (which happens when $OCF_ERR_GENERIC is returned) will ban you for a couple of minutes with an HTTP 403 status.
Another advantage of the new exit-codes (and the processing of them) is when the python interpreter fails (exit-code 1) $OCF_ERR_GENERIC is returned and no failover will happen.
All of the above amounts to this: When the webservice is unreachable, the JSON is unparsable or something happens that isn’t meant to happen, heartbeat will soft-fail and not fail over.
The files: