At Hetzner you can get very cheap servers. If your application stack can handle failovers and the like, it’s a cheap venue to setup a fairly large setup. One thing that’s a bit different than at most other colocators I know, is their network setup. They actually route all traffic via managed switches to your machine. So all machines are in their own network. That can be a problem if you want to do cool stuff like moving an IP address on the fly.
Luckily, they have provided “Failover IP” addresses, which you can allocate to a server and which you can switch to another server. But only via a web interface. The web interface also has an API, which makes things a bit easier. For one of our customers, we wrote an OCF script that can perform the failover, so we can user heartbeat and pacemaker over there.
Due to the fact that pacemaker expects all variables to be the same on both machines, we need to use several data sources. We’ve created it as follows:
- An OCF script that calls a Python script for assigning the failover IP
- The aforementioned Python script, which reads some variables from a local file (defaults to /etc/hetzner.cfg) and which actually talks to the API to switch the IP address or check if the IP address is currently assigned to this host
- A local config file which is read by the Python script and contains the Hetzner API credentials and the local machine IP address.
The local IP address in the configuration file is needed because we run all important stuff in VMs and the API expects the IP address of the iron to which you want the failover IP to point. Usually, you do not have access to the local IP address, which is why we simply set it up in the configuration file. The Python script is fairly simple. You can run it with -h to see the possible commands you can give it. The config file probably requires some explanation:
[dummy] user = #12345+RaNdM pass = sEcReT local_ip = 1.2.3.4
The user and pass can be generated from the Hetzner Robot interface. When you have selected the server to which the failover IP is assigned, select the Admin option and request new credentials. These are specific to that machine and all resources assigned to that machine. This is a safety measure. The local IP is the primary IP address of the local machine. So if you want to be able to switch the failover IP address to the machine with the local IP address of 2.3.4.5, that machine will have local_ip = 2.3.4.5 in it’s /etc/hetzner.cfg file. Are you still following this? Good!
Now, the using the OCF script is simple. Add it to /usr/lib/ocf/resource.d/kumina/hetzner-failover-ip and setup your CRM configuration as follows:
primitive IP_mysql ocf:kumina:hetzner-failover-ip \ op start interval="0" timeout="300s" \ op monitor interval="60s" timeout="300s" \ params ip="1.1.1.1" script="/usr/local/sbin/parse-hetzner-json.py"
The 1.1.1.1 should be replaced with your failover IP, of course. The script needs to be added. If you want to use another configuration file, you can change it into /usr/local/sbin/parse-hetzner-json.py -c /etc/myconfig.hetz or something that suits your fancy. The timeout is needed, because the Hetzner API is a slow beast. (On a related note, I think it’s possible to change the OCF script to use this as a default, but I couldn’t find it quickly.)
Do let us know if you have questions or if this helped you!
The files:
Update: Add monitor statement to CRM configuration, to work with scenarios where failover addresses are modified manually.
Update 2: Kumina no longer uses the code above at this moment, therefor the code is no longer maintained by us.
Your work is very nice. To avoid checking hetzner failover ip with such delay, I preferred haproxy heartbit on local meshed vpn built with tincd. Since failover ip is bound to machine real ip, bridged to a vm with tincd local vpn ip… I monitor failover “locally” in a more efficient way then you can do with those delays, but still using your script.
I still have to switch variables once a failover transition occurs (and a new failoverip-realip association arises).
Doei! 😉
Many thanks, it’s very helpful and save my time
Thanks a lot!
Hi Tim,
Is my Heartbeat config correct?
logfacility daemon
keepalive 2
deadtime 15
warntime 5
initdead 120
udpport 694
ucast eth0 IP_OF_THE_SECOND_SERVER
auto_failback on
node ha1
node ha2
use_logd yes
crm respawn
i run the OCF script within shell and i get:
/usr/lib/ocf/resource.d/company/hetzner-failover-ip start
/usr/lib/ocf/resource.d/company/hetzner-failover-ip: 165: -s: not found
hetzner-failover-ip[16916]: DEBUG: default start : 0
What i have changed into downloaded from your site script is:
OCF_ROOT=/usr/lib/ocf
Do i have to do some more changes into it?
I have no idea what’s going wrong, then. I’d suggest trying to debug from within the OCF script, add a log file or something to catch the actual error message.
I runt it manually like this:
/usr/local/sbin/parse-hetzner-json.py –ip=my_failover_IP -s -c /etc/hetzner.cfg
and it moves the IP to the specified in the /etc/hetzner.cfg local_ip
probably it should work like this too:
/usr/local/sbin/parse-hetzner-json.py –ip=1.2.3.4 -s
This is the interesting line:
WARN: unpack_rsc_op: Processing failed op IP_mysql_monitor_0 on ha2: unknown exec error (-2)
Try running the script by hand and see what error you’re getting?
Hello Tim,
I’m getting such output in the /var/log/daemon.log of the second node of the cluster when i
do /etc/init.d/networking stop on first –
http://pastebin.com/0Z1Gj7j0
Can you please help understand what is wrong?
Thanks in advance.
Thanks for these scripts. Why do you say failover is slow? It should not depend on the monitor interval of the failover ip ??
According to the Hetzner doc monitoring is limited to 100 req/hr. For a 2 node setup (cloneset) what should make a monitor interval at about 80 secs.
or am I missing something here?
80 if no failovers occur, because those are counted towards your maximum. And 80 if you only use 1 IP failover address. We use five addresses, currently, and the limit is global. It kind of adds up really quick. Also, 80 seconds is fairly long, in other setups we generally have a monitor interval of 10 seconds for plain failover IP addresses.
Thanks a lot!
It’s an inconvenience, indeed. We’ve run into that problem as well. The solution is to increase the monitoring interval to something like 600 seconds. This makes failover rather slow, however. The guys at Hetzner don’t seem to understand the concept of “failover” mechanism, I’m afraid.
We actually left them because of all these kind of problems. If you’re interested in another hoster, you might want to checkout our other project, https://www.twenty-five.nl. At least you get to talk to people there who know what they’re talking about 😉
Thanks for answer.
I has deleted resource and added it once again and now it works.
But now I get another trouble.
Seems to me Hetzner has limitation connections count per hour to his API:
In /var/log/messages
WARN: unpack_rsc_op: Processing failed op ClusterIP01_monitor_60000 on frontend02-nginx: unknown error (1)
and if I run failover ip procedure manually :
curl -u login:password https://robot-ws.your-server.de/failover/1.1.1.1
i get
{“error”:{“status”:403,”code”:”RATE_LIMIT_EXCEEDED”,”max_requests”:100,”interval”:3600,”message”:”rate
but if I want route failover ip to another server everything works well.
Is it critical for pacemaker/heartbeat?
I have configured pacemaker as described there but resource does not work. It hangs with “Stopped ” status
IP_mysql (ocf::kumina:hetzner-failover-ip): Stopped
All scripts are executable.
My crm config:
node $id=”d8e93a90-765c-4ba8-9ee3-d111adfacf9c” host02
node $id=”e5c7ff03-b032-4052-964f-248c04aa031b” host01
primitive IP_mysql ocf:kumina:hetzner-failover-ip \
op start interval=”0″ timeout=”300s” \
op monitor interval=”60s” timeout=”300s” \
params ip=”1.1.1.1″ script=”/usr/local/sbin/parse-hetzner-json.py”
property $id=”cib-bootstrap-options” \
dc-version=”1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c” \
cluster-infrastructure=”Heartbeat” \
no-quorum-policy=”ignore” \
stonith-enabled=”false” \
symmetric-cluster=”false”
How I can investigate what’s wrong?
Did you check the output from pacemaker? Does it even try to run the script? Does that script run when you try it manually?
Hi guys
Great tutorial, but I am having problems setting up hearbeat. Could you please share how your ha.cnf looks like?
Thanks,
A
So the solution must be DNAT. I was supposing that, but this is an additional confirmation.
Thanks a lot and keep up the good work!
Hi RaSca,
You can’t at hetzner. You can only failover to another physical machine at hetzner. We use IP-tables on that host to forward the packets to the VM.
Great job guys!
Have you any suggestion on how to associate this failover IP to an internal virtual machine? I mean, what if I want to set a virtual gateway with a WAN and a LAN on which the WAN IS the failover IP?
Thanks a lot for this precious job!
RaSca