OSX Failover - Part 1
Volume Number: 23 (2007)
Issue Number: 03
Column Tag: Network Administration
OSX Failover - Part 1
A Beginner's Guide
By Ben Greisler
Introduction
OS X Server has the capability to provide IP failover, a high availability feature that allows a secondary backup server to take over for a failed primary server. It is a great feature and can be very handy keeping your services available, but it has its limitations and constraints. We will review the basics of IP failover in this article and then expand on the concept in later issues. This is aimed at getting the beginner up and running with a minimum of hassle.
IP Failover Concepts
There are two major parts to the failover process: The primary server sending out notification that it is up and running and the secondary server monitoring the signal from the primary server. Kind of like, "Can you hear me now?" but without the primary server repeating "Good" after each question. This process is done via two daemons, heartbeatd and failoverd. Both are available on OS X Server, but not on OS X client.
On the primary server, heartbeatd sends out a message every second via port 1694 on both of the network interfaces involved in the process. This is the signal to the other machine in the failover pair that the primary is still alive and well, or at least well enough to keep a heartbeat going.
On the secondary server, failoverd listens for the heartbeat message on port 1694 on both network interfaces. If it stops receiving the heartbeat message it will start the failover process.
Initial configuration of IP failover starts in /etc/hostconfig where you define what role each server will be. We'll get into the specifics in the next section. There is a startup item at /System/Library/StartupItems/IPFailover that checks for configuration specifications and starts either heartbeatd or failoverd located in /usr/sbin as appropriate.
When failoverd on the secondary server realizes that it isn't receiving a heartbeat message, it sets off a series of events based on scripts located in /usr/libexec. The script NotifyFailover grabs the email address of failover recipient from /etc/hostconfig and sends a message to that address. It then utilizes the ProcessFailover script which will make an IP alias on a network interface, allowing the secondary server to take the IP address of the primary server. Both of these scripts are available for examination and are pretty well commented.
Another purpose of the ProcessFailover script is to execute scripts located in the /Library/IPFailover/ folder. This folder does not exist in a standard install of OS X Server and has to be created if needed. Within that folder can be 4 subfolders: PreAcq, PostAcq, PreRel and PostRel. You can utilize these folders to perform certain actions. The names are self-explanatory and define when the content scripts will be used (i.e.: before IP acquisition or after the IP release, etc). This is where the power and flexibility of IP failover resides.
More information can be found in the High Availability Administration document http://images.apple.com/server/pdfs/High_Availability_Admin_v10.4.pdf , but it does have some incorrect information as referenced in this Apple tech article: http://docs.info.apple.com/article.html?artnum=305066
Setting up IP Failover
In this article, we will set up the most basic IP failover configuration to show that it works. In general, IP failover can be done in three easy steps:
1. Set up OSX Server on two machines with appropriate network configurations.
2. Add the appropriate entries to /etc/hostconfig on both machines.
3. Reboot each machine and have a working IP failover pair.
Easy, huh? Ok, now to the steps needed to accommodate the above.
It is best that the two machines in the failover pair be as identical as possible. You wouldn't want the machines to be on different OS versions, or have a secondary server that can't handle the load that the primary server normally handles. It is also tempting to give the secondary server other work to do while it is just sitting there listening to the heartbeat of the primary server, but refrain from that. Its job is to be a backup server, pure and simple.
We need to set up two networks for the IP failover pair to join. One will probably be your existing network that your other machines use to connect to your server. The other network will be a private network that the pair will communicate over. Typically this will be IP over Firewire. You don't have to do it this way, but it does preserve your secondary Ethernet port on machines that have one and allows a private network on machines that don't have a second Ethernet port (i.e.: MacMini).
Let's set up our networking like this:
Primary Server
192.168.254.165 on en0
255.255.255.0 Subnet Mask
192.168.254.1 Gateway
10.0.0.165 on fw0
255.255.0.0 Subnet Mask
Secondary Server
192.168.254.170 on en0
255.255.255.0 Subnet Mask
192.168.254.1 Gateway
10.0.0.170 on fw0
255.255.0.0 Subnet Mask
Make sure that you have good DNS entries for both machines and test them. Do not enter DNS servers or gateway information in the Firewire interface.
Now, let's edit /etc/hostconfig on each server (using your favorite editor via sudo). Add the following lines:
Primary Server
FAILOVER_BCAST_IPS="192.168.254.170 10.0.0.170"
FAILOVER_EMAIL_RECIPIENT=user@domain.com
Secondary Server
FAILOVER_PEER_IP_PAIRS="en0:192.168.254.165"
FAILOVER_PEER_IP="10.0.0.165"
FAILOVER_EMAIL_RECIPIENT=user@domain.com
So, what does all that mean?
FAILOVER_BCAST_IPS="192.168.254.170 10.0.0.170"-This identifies to the primary server the IP addresses of the network interfaces of the secondary server. You can either specify the IP's of the secondary server or use the broadcast addresses for the subnet (i.e.: 192.168.254.255, 10.0.0.255)
FAILOVER_PEER_IP_PAIRS="en0:192.168.254.165"-This identifies the primary interface IP of the primary server. Note the syntax of "en0:" when creating your configuration.
FAILOVER_PEER_IP="10.0.0.165"-This identifies the secondary interface on the primary server. In this case it is the Firewire port (fw0).
FAILOVER_EMAIL_RECIPIENT=user@domain.com-This is the email address of the person who needs to know about failover actions. Make sure that your machine is configured to be able to send mail. You may need to configure SMTP services.
Hook up the servers to the Ethernet network and connect a Firewire cable between the two machines. Check that you can ping each machine on each interface from each machine. Both machines need to be able to see one another. Now restart the primary machine and then the secondary. This is important because if you start the secondary machine before the primary, it won't hear the heartbeat message from the primary and will try to failover immediately.
Ok, now that each server is up and running let's test it out. On a third machine, ping the primary server's public IP address. You should get a good solid return. Now open up Console on each machine and view the System log. Using tail on /var/log/system.log so you can see what is going on with each machine, alternately pull the Firewire cable and then Ethernet cable on the primary machine. You will notice that you stop getting ping responses from the primary server. Wait a few seconds and you should see the pings start to return again. This is the secondary machine reacting to the loss of the heartbeat message from the primary machine and initiating the ProcessFailover script to allow the secondary machine to acquire the IP of the primary machine. You have just gotten IP failover to work!
To failback, I suggest not just plugging the cables back into the primary machine. In a production environment you may have to shutdown the secondary server in a controlled manner, bring the primary back on line and then bring up the secondary. This is inconvenient as it would be great if you could just have everything failback to its original state, but practice has shown that this doesn't happen exactly the way you would want it to in every case.
Conclusion
So, it's great that we can failover from one server to another, but what good does this really do us? In the next article we will start making IP failover do some tricks for us that will be useful. Stay tuned!
References:
http://images.apple.com/server/pdfs/High_Availability_Admin_v10.4.pdf
http://docs.info.apple.com/article.html?artnum=305066
man heartbeatd
man failoverd
Ben has worked Apple based technology integration projects from Maine to Japan while learning all the way. When not collecting frequent flyer miles he spends his favorite time with his wife and 2.5 year old daughter at their home outside of Philadelphia. He can be reached at
magikben@mac.com.