Posts Tagged ‘pacemaker’

Hetzner Failover IP OCF script part III: When HTTP attacks

Wednesday, March 16th, 2011

Our OCF script for failovers at Hetzner worked flawlessly the last month. Last week, however, a problem arose we did not anticipate. The webservice returns an HTTP statuscode (as is expected from a webserver) and we did not anticipate any HTTP errorcodes.

An HTTP response in the 4XX or 5XX range would kill the python interpreter with a traceback from urllib2 and an exit code of 1, a code which told the OCF script to return $OCF_NOT_RUNNING which caused a failover to occur. This wouldn’t be a problem in a normal operating environment.

Unfortunately, we noticed that the Hetzner failover webservice isn’t totally stable. This happens on both hosts in the failover setup, who will both try to failover and cause havoc. Fortunately, OCF has an errorcode which means a soft fail ($OCF_ERR_GENERIC), we can use this code to tell heartbeat a temporary failure has occurred and it should not failover.

The script now has a try-except construction for the HTTP requests and has 3 exit codes:

  • 0: Everything OK, I have the failover-IP
  • 1: Unknown Error, can’t get status of the failover-IP
  • 2: Everything OK,  I do not have the failover-IP

The error-codes are then processed by the hetzner-failover-ip OCF script as follows:

${OCF_RESKEY_script} -g -i ${OCF_RESKEY_ip}
case $? in
return $OCF_SUCCESS ;;
return $OCF_NOT_RUNNING ;;
sleep 30 # Do not DOS Hetzner
return $OCF_ERR_GENERIC ;;

The sleep 30 is required, as too many requests to the Hetzner failover webservice (which happens when $OCF_ERR_GENERIC is returned) will ban you for a couple of minutes with an HTTP 403 status.

Another advantage of the new exit-codes (and the processing of them) is when the python interpreter fails (exit-code 1) $OCF_ERR_GENERIC is returned and no failover will happen.

All of the above amounts to this: When the webservice is unreachable, the JSON is unparsable or something happens that isn’t meant to happen, heartbeat will soft-fail and not fail over.

The files:

Hetzner Failover IP OCF script part deux: local DNS resolving

Wednesday, February 23rd, 2011

Two weeks ago we published a script that allows one to update the failover address provided by Hetzner using an OCF script. This makes it possible to provide redundant services between two systems within the Hetzner network. Even though this script by itself seems to function properly, it does have one shortcoming.

Consider a setup where both systems provide a set of services that use the same data store (e.g. a MySQL database). Even though these database services are replicated, queries must always be processed by the master node. Naively, one could solve this by simply letting all these services use the failover address provided by Hetzner. This will however not work, for the reason that even though traffic from the outside will always be routed to exactly one of the two systems, both systems have the address defined locally. The only way to perform connections between both systems, is by using the per-system (non-failover) addresses.

Hetzner Failover IP OCF script

Friday, February 11th, 2011

At Hetzner you can get very cheap servers. If your application stack can handle failovers and the like, it’s a cheap venue to setup a fairly large setup. One thing that’s a bit different than at most other colocators I know, is their network setup. They actually route all traffic via managed switches to your machine. So all machines are in their own network. That can be a problem if you want to do cool stuff like moving an IP address on the fly.

Luckily, they have provided “Failover IP” addresses, which you can allocate to a server and which you can switch to another server. But only via a web interface. The web interface also has an API, which makes things a bit easier. For one of our customers, we wrote an OCF script that can perform the failover, so we can user heartbeat and pacemaker over there.