VRRP for redundant network services

DRAFT DRAFT DRAFT

Abstract

The Virtual Router Redundancy Protocol (VRRP) was developed to eliminate single points of failure in statically routed, default gateway environments. In this paper I make use of VRRP to provide redundant access to network services rather than routing paths. I use a VRRP ``virtual router'' as a method of providing cheap, reliable and standards based redundancy for a group of hosts serving common internet network services. I provide configuration hints and highlight problems that may be encountered.

Abstract
Virtual Router Redundancy Protocol (VRRP)
Test environment
1. Operating system, hardware, network
2. VRRP software
Miscellaneous notes
Test results
Advanced techniques
Pitfalls
Security considerations
Conclusion
Notes

Virtual Router Redundancy Protocol (VRRP)

VRRP is described in RFC 2338 as:

VRRP specifies an election protocol that dynamically assigns responsibility for a virtual router to one of the VRRP routers on a LAN. The VRRP router controlling the IP address(es) associated with a virtual router is called the Master, and forwards packets sent to these IP addresses. The election process provides dynamic fail over in the forwarding responsibility should the Master become unavailable.

VRRP was conceived to solve the problem of end-host over reliance on a statically configured default routing gateway. Should the gateway cease to function, the hosts it services become completely stranded. VRRP, similar to the proprietary Hot Standby Router and IP Standby protocols, allows an administrator make an IP address into a virtual address, dynamically assigning it to any member of a VRRP group. Each group is denoted by a number, called a VRID. In this fashion, the responsibility for the virtual address can be moved from one host to another either by administrative intervention or automatic detection of a failure. A master makes a multicast announcement of its health periodically. An announcement of this type is called heartbeat. Hosts which are not the master for a given VRID are referred to here as backups or slaves. Should the master's heartbeat not be received for a period of time the slaves for a VRID will promote one of their kind to be the new master.

These characteristics are desirable for more than routers. VRRP is generic enough -- in essence it moves an IP address among hosts -- to be taken advantage of in end-host service provision.

In this paper I create a VRRP virtual group composed of servers which will provide redundant access to a set of common internet services. These hosts are not routing devices. The addresses publicised to clients for service provision are virtual addresses managed by VRRP.

Test environment

Operating system, hardware, network

I use three systems. The ``servers'' are a pair of Pentium class PCs running FreeBSD 4.7-RELEASE-p3. The network ``client'' is a Windows 2000 laptop. The three share a single ethernet on an eight port hub.

The preferred server, clarinet, is 10.10.10.6 and has a priority which will make it the master whenever it is available. The backup server is fridge and uses 10.10.10.2. It has a low priority which means it will become master only when clarinet becomes unavailable. Together they share one virtual group and a number of virtual service addresses. The client, jimbob, is 10.10.10.5.

VRRP network

The simple test network is illustrated above.

Simulation of failure and recovery

I simulate a system failure affecting clarinet by disconnecting its network cable. Given that, recovery is obvious. Reconnect the cable.

VRRP software

freevrrpd

I use the freevrrpd daemon from http://www.bsdshell.net/. FreeBSD users can install it very simply, using the port skeleton in /usr/ports/net/freevrrpd. VERSION!!

freevrrpd is in early development but performed well in the tests. A very useful feature is the ability to execute a command when a host becomes a master and a command when it becomes a slave. This simplifies the task of notifying applications that the available interfaces on a host have changed. I use a shell script in each case so I can perform a number of tasks when there is a failover, and the tasks can be changed without needing to restart freevrrpd. I'll refer to these scripts as event scripts.

Configuration

I use an intentionally simple configuration file, needing only one VRID. I use password authentication to raise the bar for blind spoofing. See the Security considerations section below. The configuration file from fridge is shown below:

[VRID]
serverid = 1
priority = 100
addr = 10.10.10.53/32,10.10.10.80/32,10.10.10.25/32,10.10.10.22/32
masterscript = /usr/local/etc/vrrp/vrid1_master.sh
backupscript = /usr/local/etc/vrrp/vrid1_backup.sh
password = foobingbaz
interface = xl0

The configuration on clarinet differs only in its priority; clarinet uses 250. The higher priority selects itself as the master.

On the surface, the addr line may seem strange. I have always preferred, when possible, to use separate IP addresses for distinct services. It gives me more options to relocate services in the event of problems. freevrrpd supports more than one virtual IP address per VRID so I can combine VRRP with the flexibility of separate addresses. An individual service can be moved to another machine (or VRID) very easily.

I place the events scripts in their own directory. I recommended you give them names that clearly show the VRIDs that use them and whether they are a masterscript or backupscript.

Leg work

Startup scripts

For each application you should have a method of starting it if and only if it is not already running. The first invocation of the start method should do as requested. Subsequent requests should notice the application is already running and just exit successfully. This will greatly simplify your masterscript.

INADDR_ANY

Generally speaking you're going to encounter problems with programs that want to bind a socket on each interface that is present on the machine. A problem arises on the slave hosts. An application started on a slave cannot bind a socket to a virtual address because the virtual interfaces are not present on the system until it becomes a master. How can you bind on an interface that doesn't yet exist? You can't. Where possible you want your applications to bind to the wildcard, INADDR_ANY (a.k.a 0.0.0.0, ``*'' in the output of netstat). The kernel will take care of getting packets to the application when it becomes a master.

Who's there?

An administrative requirement that is common to each case examined below is the need to be able to determine which host is home to a virtual address at any given moment. You may just be inquisitive or you may like your system monitoring to flag when a virtual address has moved onto a backup server, or vice versa.

For each of the services tested I've included a straightforward suggestion on how this could be achieved. It's not rocket science so I don't labour on the point.

Data consistency

An obvious meta-requirement is keeping members of a virtual group in sync. You need to be sure, for example, that your web servers all have the same copy of your website. The mechanics of that are outside the scope of this paper.

Application installation

I'm assuming that the software used here is installed in a somewhat conventional manner, in line with its documented procedure. I'll expect supporting programs such as apachectl, svc or tcpserver are installed and usable. I won't be using explicit paths for binaries. That's up to you and your administrative conventions.

Test results

World wide web

Apache

The Apache HTTP server works in the VRRP configuration, in both 1.3.x and 2.0.x guises, without any hoop jumping. Apache's default socket binding behaviour works perfectly. It will bind to INADDR_ANY by default. You should read up on Apache's Listen and BindAddress directives if you need to change that behaviour.

HTTPS/SSL servers need a little extra care. Your SSL certificates will be validated by clients based on the host name they use to access the site. You should make sure that the same certificate is on both servers and remember to update both when the expiry date comes around.

To determine the name of the master, put a file named whoami in the DocumentRoot of each host. Point a web browser at the virtual server and request /whoami.

publicfile

Electronic mail

sendmail

The default for sendmail is to bind a socket to the wildcard interface, which suits our needs perfectly. If for some reason your sendmail does not do this, you can use the DAEMON_OPTIONS statement to force it to. Remember that if you use an MSA you will need to specify this statement twice to apply it to the MTA and MSA listeners.

DAEMON_OPTIONS(`Addr=0.0.0.0')dnl

sendmail will use its idea of the system name in its default greeting:

220 fridge.example.org ESMTP Sendmail 8.12.6/8.12.6; Wed, 5 Mar 2003 19:41:45 GMT

If you dislike how much information it discloses, as I do, you can modify its notion of the system name -- though this affects many things -- or completely replace the 220 response text.

#
# Change the system name ($j macro). This will affect headers,
# masquerading, unqualified sender delivery and a raft of other things.
#
define(`confDOMAIN_NAME', `fridge.example.org')

#
# I prefer to leave that alone, and instead replace the 220 response
#
define(`confSMTP_LOGIN_MSG', `$j ESMTP No UCE')

Note that the first word of a 220 response must be a FQDN, and a sendmail server will parse the second word to determine whether the host talks SMTP or ESMTP.

qmail

tcpserver -vHPR 0 smtp /var/qmail/bin/qmail-smtpd
tcpserver -vHPR 0 pop3 /var/qmail/bin/qmail-pop3d

control/me, control/smtpgreeting

Name resolution

Common observations

Restarting your cache will lose its contents. The nameserver will operate at reduced performance until it again populates its cache with a reasonable working set of data.

A caching resolver needs to send queries to other nameservers in the process of answering a client query. Some people get worked up about whether these queries should have a source address showing the virtual address or the host's real address. Personally I don't think it matters all that much. It probably should be from the real address but if it's not and a failover causes the new master to receive a reply to a query it didn't send the world keeps turning.

env/IPSEND, query-source

ISC BIND

Versions 8 and 9 do not differ significantly in the areas that matter for this paper.

BIND will bind a separate socket to each interface on the system. This presents a significant problem, the solution to which is your choice of the lesser of two distinct evils. When a host becomes a master named needs to be told to check for new interfaces, so it can attach a socket to the virtual interface address. You have two alternatives to accomplish this, neither of which is pleasant.

Do not run named as an unprivileged user, use reload: named needs to attach a socket to port 53, which can only be done by UID 0. If you run named as an unpriv. user, once it has dropped privs it cannot attach to port 53 when the master address moves here. If you run it as root, causing a reload will rescan the interfaces -- BIND 8 will stop answering during this time though -- and pick up any changes. Alternatively, you could use the interface-interval statement to scan automatically. Minimum of one minute though.
Full restart of named.: When it starts up it will bind to each interface, including the now present virtual address. This can take time, especially if you are serving a few hundred zones. While BIND8 is loading it will not answer any queries which means downtime. BIND9 will answer for the zones it's loaded so far. Also, if BIND is a caching resolver you will have lost the contents of your cache -- the nameserver will operate at reduced performance until it populates its cache with a reasonable working set of data. Means can run as non-root.

Your masterscript should do the necessary.

If your named configuration file already uses a listen-on {}; statement, you should make sure to include the virtual addresses. XXX what errors does named say when specified addrs aren't here.

Server identification is done with a TXT resource record. Add a zone called who.ami to the named configuration file.

zone "who.ami" IN {
	type master;
	notify no;
	file "zones/who.ami";
};

and include in the zone file a TXT record giving the host's name:

$TTL 12H
@	IN SOA	vrrp-ns.example.org. dns.example.org. (
			2003030100	; serial
			3600		; refresh
			1800		; retry
			2592000		; expire
			3600 )		; nxdomain ttl
	IN NS	vrrp-ns.example.org.
	IN TXT	"fridge"

Then query for the TXT RR of who.ami using your favourite DNS query tool. The output from DiG is below.

djbdns

djbdns separates the functions of caching resolver (dnscache), udp authoritative server (tinydns) and tcp authoritative server (the confusingly named axfrdns) into different programs. I will look at them in two sections.

dnscache

Trickier - use root/servers to force zone to auth. server with appropriate data, or tinydns on localhost (clash with INADDR_ANY?). Client differentiation, with root/servers/who.ami

tinydns / axfrdns

The tinydns data storage mechanism means it does not have to read in zone data when it starts up. It starts, and is ready to answer queries, immediately. Therefore automatically stopping and starting it is quite practical. To start tinydns when transitioning to a master, use:

svc -u /service/tinydns

in the masterscript and

svc -d /service/tinydns

in the backupscript to stop tinydns when becoming a slave.

I used the same identification method with tinydns as with BIND. Add a TXT resource record called who.ami to /service/tinydns/root/data:

.who.ami:10.10.10.53:vrrp-ns.example.org
'who.ami:fridge:300

Use your favourite DNS query tool to query the virtual IP address to see which system is the master:

bash-2.05a$ dig @10.10.10.53 who.ami txt +norec

; <<>> DiG 8.3 <<>> @10.10.10.53 who.ami txt +norec 
; (1 server found)
;; res options: init defnam dnsrch
;; got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 6024
;; flags: qr aa; QUERY: 1, ANSWER: 1, AUTHORITY: 1, ADDITIONAL: 1
;; QUERY SECTION:
;;      who.ami, type = TXT, class = IN

;; ANSWER SECTION:
who.ami.                5M IN TXT       "fridge"

;; AUTHORITY SECTION:
who.ami.                3D IN NS        vrrp-ns.example.org.

;; ADDITIONAL SECTION:
vrrp-ns.example.org.    3D IN A         10.10.10.53

;; Total query time: 2 msec
;; FROM: fridge.botanic.ave to SERVER: 10.10.10.53  10.10.10.53
;; WHEN: Sun Mar  2 15:07:09 2003
;; MSG SIZE  sent: 25  rcvd: 93

Remote access

It's not immediately obvious why anyone would want SSH shared between a number of different machines. Surely, you say, remote access is required to particular hosts, not a virtual host which might be any one of the VRRP group. That said, perhaps you want to SSH to 'the active web server' or similar. It's not my place to judge ... :)

OpenSSH

ListenAddress 0.0.0.0 (the default) will work. Your SSH client will (should) warn you about modified host keys if you connect after a failover has occured. While this is expected, you should make sure this doesn't condition you out of paying attention to these warnings. Without them, SSH is useless.

The Banner option can make sshd print the contents of a file before prompting the user to login. Mentioning the system's hostname easily identifies which host you are connecting to. Your sshd_config might contain:

ListenAddress 0.0.0.0
Banner /etc/issue.net

And /etc/issue.net might say:

fridge.example.org - No unauthorised access

Advanced techniques

Load balancing

A crude but effective (and cheap) method of load-balancing is to use multiple DNS A records, announced with a low time-to-live (TTL) owned by a generically named label, such as the following resource record set (RRset):

mail.example.org.	600	IN A	10.20.1.1
			600	IN A	10.20.1.2
			600	IN A	10.20.1.3
			600	IN A	10.20.1.3

This relies on the common behaviour of DNS servers to either randomise or round-robin the order of the A records. Over time, in the above example, 10.20.1.3 would receive half of the incoming connections while the other hosts would receive a quarter each.

A serious flaw of such setups is that unresponsive hosts cannot quickly be removed from the service pool. Any change (manual or automatic) to the RRset won't be fully effective until the time specified by the TTL period has elapsed. In the above example, if 10.20.1.3 were to fail, approximately half of new sessions would have to wait for the client to timeout and try another A record. Particularly unlucky clients may end up getting the second 10.20.1.3 record and failing again. For this reason DNS round-robin based load-balancing is considered a poor choice, forcing administrators to choose more exotic, and expensive, solutions.

VRRP makes DNS load-balancing a feasible option. The cost of DNS load-balancing is the danger that one of the IP addresses published in the DNS may become unavailable, thereby causing service disruption until it can be removed from the RRset. Using VRRP virtual addresses in A records reduces that danger considerably. By using two VRIDs between two hosts -- each a master for one -- you can use DNS to split the load across both but with confidence that should either fail the other will step in and provide service for the other VRID.

[VRID]
serverid = 50
# master for VRID 50: 10.10.1.50
priority = 250
addr = 10.10.1.50/32
masterscript = /usr/local/etc/vrrp/vrid50_master.sh
backupscript = /usr/local/etc/vrrp/vrid50_backup.sh
password = crashbangbump
interface = xl0

[VRID]
serverid = 55
# slave for VRID 55: 10.10.1.55
priority = 100
addr = 10.10.1.55/32
masterscript = /usr/local/etc/vrrp/vrid55_master.sh
backupscript = /usr/local/etc/vrrp/vrid55_backup.sh
password = whizzflump
interface = xl1

With a little DNS, you can distribute load as one-third, two-thirds without fear of disruption should either server fail using:

mail.example.org.	600	IN A	10.10.1.50
			600	IN A	10.10.1.55
			600	IN A	10.10.1.55

Adding and removing VRRP servers

Introducing a new machine intended to be a master is seamless. Set its priority appropriately and let the VRRP election do the work.

Stopping a master is a little more involved. The next preference slave will not promote itself until three times the heartbeat interval have passed if you simply disconnect the master. It's certainly not the end of the world but if you have advance warning there's no excuse for not having a smooth transfer of service.

Instead, modify the configuration of the slave you wish to become master and set its priority to higher than the current master. Then HUP freevrrpd on the slave. The reconfigured slave will become a master without any service interruption. Once that is completed the original master -- now an inactive slave -- can be brought down without impact.

Preventing failback after a failover

In some circumstances you may not want a master which experiences a problem, fails and subsequently recovers to resume responsibilty as the VRID master. Perhaps you insist on an a formal investigation of the cause before it resumes service, or you want to prevent flapping -- repetitive failures and recoveries in quick succession which would cause significant disruption to remote clients as the VRID moves to and fro.

Modify the freevrrpd.conf in a masterscript

Pitfalls

Be very, very wary of using the highest priority, 255. A host with a priority of 255 cannot be demoted to a slave. Such inflexibility works against the whole point of deploying a VRRP group. Refer to the Adding and removing VRRP servers section for why this is such a bad idea.

Any output from the event scripts is discarded - you will need to be sure they are quite clever as they're not going to be able to tell you what's gone wrong unless they write their own logs or submit messages to syslog.

Use one VRID per physical interface. VRRP moves not only IP addresses but a computed ethernet MAC address among hosts. The computation is a function of the interface's VRID. freevrrpd will recompute the VRRP MAC as it reads additional VRIDs from its configuration, which will cause it to apply in rapid succession a number of different MAC addresses to an interface. Should a client send an ARP query for the MAC of a VRRP IP address while freevrrpd is setting up multiple VRIDs, the client may receive a reply including an intermediate MAC instead of the final MAC. That client will be unable to communicate with the virtual address until its ARP cache entry times out.

Security considerations

Common sense dictates that you restrict physical access to the network segment you use for your server LAN. Someone who can place VRRP announcements onto your server segment could announce their system with a high enough priority to acquire master status for your VRIDs. You can use password authentication to raise the bar for blind spoofing. That said, anybody starting a rogue VRRP daemon on your ethernet can likely sniff traffic and will quickly pick up your password. Use a separate, dedicated ethernet segment for your important systems.

It should go without saying, but I'll say it anyway: none of this, in any way, removes the responsibility to keep your system and applications patched and current with respect to security vulnerabilities. This also applies to freevrrpd itself.

The freevrrpd daemon needs a bpf device. Without it, it cannot run. Personally I'm not concerned by the presence of bpf devices but some security recipes recommend removing them. If you have followed such instructions you will need to add a pseudo-device statement to your kernel configuration and rebuild. The GENERIC kernel already includes the following:

# The `bpf' pseudo-device enables the Berkeley Packet Filter.
# Be aware of the administrative consequences of enabling this!
pseudo-device	bpf		#Berkeley packet filter

which will work fine.

Conclusion

Yeah, a conclusion. Why not. Writing one sounds like a good idea.

Notes

RFC 2338: Virtual Router Redundancy Protocol
IETF VRRP Working Group