Enterprise Backup and Recovery
Volume Number: 21 (2005)
Issue Number: 9
Column Tag: Programming
Business Continuity: Enterprise Backup and Recovery
Practices for Streamlining Large-scale Backup and Recovery Implementations
by R.
Carvel Baus
Introduction
When your boss asks you to sit down and says "We would like you to help us with something," many
questions go through your mind including how long it may take you to find another job. Generally
speaking employees are not asked to do something, they are given a task and expected to see it
through. So when I was asked to take over the Enterprise Backup System for the administrative
network, it looked like a great opportunity to gain more experience in an already familiar area -
but a little voice in my head kept whispering, "why are they asking you?" Part of the asking was in
the nature of the work. It was an additional assignment to what I was currently doing so they wanted
to make sure I was up for the task, which I was, despite my little voice (Note: When ones little
voice awakens, stop everything you are doing and read RFC 1925 - this is key to ones survival as an
administrator of network attached gear - see http://www.ietf.org/rfc/rfc1925.txt) I had experience
in data backup and recovery including writing a custom Linux-based backup solution for a small
company, but nothing on this scale: 3,000 users across multiple sites, 40 servers, a 420-slot, six
drive tape library and several vaults for long-term storage. Having neglected to read RFC 1925
innocent naivety turned to foolish enthusiasm.
I was handed the torch and off I went running, with a backup system that was in what you might
call a "less than peak" operational state. It was like getting your first car - you think it is the
greatest thing in the world because it is your first real car (no more bikes or go-karts for you
because you have your license to drive.) Wisdom begins to set in over time and you realize that
multiple shades of rust and the smell of burning oil are not as cool as you once thought, and in
fact, are indicative of much larger yet to be seen issues. So here I am with my first, uh, car.
Backups would start in the afternoon dragging well into the next day. The window would often surpass
15 hours ensuring a call from the network engineer, demanding to know why the backup server (just
another machine to him) was consuming so much bandwidth while users were on the network (think
burning oil.) It seems the admin before me was not attentive to the system's day-to-day needs. He
basically made sure there were tapes in the jukebox whenever it needed feeding and beyond that he
was conspicuously never around - perhaps to avoid those calls...and the smell of...uh...burning oil. When
clients were added to the schedule, they were just thrown into a group without thought to how much
data there was or where on the network the client resided. Hey, if three machines could be backed up
in the mid afternoon, why not three more? But wait!! There is more! To make sure we don't miss any
important data, lets tell the system to backup the whole machine: the OS, all directories, and for
good measure, any drives that are network attached as well. My little voice was no longer whispering
- it was now having a good belly laugh on the floor.
Here I was with my heaping pile of smoking, rusty metal, limping its way along our private
information highway with no exit in sight. So what is one to do except pull over, break out a can of
WD-40, a pair of vice grips and the all-important roll of duct tape. After several weeks of
examining the schedules, backup logs and data on the 40 servers, while dodging phone calls from the
network engineer, I was able to reduce the window to 6.5 hours. It now started at 10pm and ran until
4:30am during the workweek. We have people who come in at 'O-dark-thirty' (5:30 am) so backups
running beyond 5am were not acceptable. I also wanted some time for the occasional unexpected
overrun. On the weekends our backups are considerably larger and run until they are done - usually
12-18 hours.
In this article I want to share the things I learned that help cut the backup window and tape
consumption in half, making the enterprise backup and recovery system more efficient and
cost-effective. I will also discuss some aspects of how to be more organized as an administrator of
an enterprise-level system. The suggestions I provide certainly aren't the only solutions, but I
have found that they work well when done consistently. Some things will apply to smaller systems and
some won't. Either way they are good to think about as a system grows.
The first topic I go into (after a few terms) is the development of a Backup and Recovery System
from inception to a higher level of maturity. I give an overview of how one might perceive the
system as it grows and subsequently develop the wrong mindset towards backup as a result. It may
seem elementary at times but please read through it as the suggestions I make later on are built
from the thought process I explain in the beginning.
Terms and Conditions
Well, no conditions really, just terms. I want to go over a few of the words and acronyms I use
throughout the article so we are on the same page (no pun intended.)
Backup Server - I will often refer to this as simply "the Server." This is the machine
that contains the backup software and orchestrates the tasks necessary to copy the data from
machines on the network to a storage medium. For simplicity, assume that this includes any
additional secondary servers used to backup the network (A storage node in Legato Networker terms.)
For the purpose of this article all such machines are collectively referred to as a Backup Server.
Backup Strategy - The methods and practices employed by the system administrator to
ensure that the data can be recovered.
Backup Solution - Consists of two main parts: 1) The Backup Strategy, and 2) the
hardware needed to backup the data. The hardware consists of the backup server, network hardware,
clients, etc.
Client - In the context of backing up, it is a machine containing data that, in the
case of a loss, would need to be recovered. This could be an individual's personal machine or a
network server. As far as the Backup Server is concerned any machine that it must backup is a
client.
BRS - Backup and Recovery System. The Backup Server and any direct or network attached
media (tape jukebox, disk array, JBOD, etc.,) used to hold the data that has been backed up. This
does not include any client machines.
NIC - Network Interface Card (in case you weren't aware.) Typically Ethernet but may
also be fiber in higher end machines.
Silent Failure - When there is no obvious indication that a client machine has not
successfully been backed up within the backup window. My personal favorite silent failures are the
ones that appear in an error log...somewhere...as a one-line unintelligible error not associated with
anything that actually failed.
Backup Window - The length of time available in one 24-hour period (or between the
close of one business day to the open of the next business day) to backup all clients on the network
without impacting users. Typically it will start late in the evening and end in the early morning
before users start another productive day of surfing. I will refer to this as simply "the window."
Saveset - the set of files for a given client that are copied during a backup session.
In some software, one can define multiple savesets per client for the session. Unless otherwise
stated we are talking about one saveset per client.
The Nature of the Beast
Backup and recovery are two interrelated symbiotic animals that are completely different. Chew on
that for a second - I had to. So what did I just say? Backups exist because of the need for recovery
and recovery cannot exist without the backup having been executed properly. One more time, please.
Backups exist only because we want to recover, not because we want to backup. Because of cost, no
one backs up the network just to backup the network - it is rooted in the fact that at some later
point in time, you may need to recover the data. So when the loss of data outweighs the expense of a
backup solution, the solution will be implemented. When I say backup solution, I am not referring to
just the machine that runs software to backup clients. I am also talking about a backup strategy
that is employed to ensure the ability to restore. An effective backup solution encompasses two main
parts: 1) the backup server, network hardware, and clients, and 2) the strategy developed by the
administrator to ensure that he or she can restore the moment they need to.
Figure 1: Simple Backup
Solution
At first a backup solution is small (Fig. 1.) For one or two computers, it may be something as
simple as an external hard drive attached to each machine. The data gets copied either by the press
of a button or a scheduled task that runs once daily. Voila! You have backup. At that point it is
easy to relax and think, "I backup therefore I am safe." In a system like this one, it is a
reasonably valid assumption to make because there are few things that could go wrong and the backup
can be verified with little difficulty. In addition, the likelihood of loosing your data
simultaneously on the computer and external drive is low. Your system is small. You have your backup
- you can restore. Life is good.
As time goes on two things happen. The first is that your data outgrows the capacity of the
backup server (an external hard drive in this example.) More machines get added, as well as the
network hardware to connect them. The business is more established and the data has greater value. A
single external drive per machine is no longer an adequate or efficient means to protect your data.
So now a backup server is added with either an external disk array or most likely a tape drive or
jukebox. You now have the capacity to backup the greater amount of data. The second thing that
happens, and what I believe is most overlooked, is that now, the backup strategy can no longer
handle the other half of the backup solution. What do I mean by this? As I mentioned, the solution
consists of two main parts, the network (all that hardware) and the strategy. The solution must not
only manage the amount of data we need to restore, but it must also be able to account for the
complexity of the network. Let me explain.
When there was just one machine and a single external drive, the network hardware was for all
practical purposes the USB cable connecting the two devices. The chances of it failing were slim to
none so we didn't specifically account for it in our backup solution - we just knew it would work.
When the number of clients increased there was more equipment added out of necessity: an Ethernet
backbone, a router or switch, a backup server, etc. Without thinking about it, at least from the
position of restoring the data, we significantly increased the number of places a backup failure
could happen. Having not thought about that, neither did we consider how to handle the additional
potential for failure in our backup strategy. To be fair, a backup solution is seldom going to
handle a major network failure autonomously. Something that large, though, isn't a concern because
human intervention is required regardless. What the solution needs to account for are little
failures that are never seen until you go to recover and can't. Call them gremlins, ghosts in the
machine, whatever. I call them silent failures and they are perhaps the greatest reason why recovery
fails.
Recovery is simple - find the tape (media) that has the data you want and restore it. Backup,
that other interrelated beast that restore depends upon, is far more complex in that it requires a
lot more time, energy, and thought to ensure that the one time you need to restore, you can do so
and do so quickly and reliably - this is the backup strategy. The nature of the beast is that while
the restore is the Holy Grail, the Quest for it resides in the backup strategy. Much time and energy
should be expended in your quest so that when you need to restore, you can.
Pre-empting failures in larger networks
Whenever a network increases in size or there is a fundamental change in network structure, there
is also an increase in the possibility of a failure. A failure is anything that prevents the backup
of all clients from completing successfully during the backup window. If a backup extends outside of
the window and completes, it is still considered a failure even though you can recover the data.
Why? The whole goal is to get it all done inside the designated window - that too is part of a good
backup strategy. Here are some things to look for that will help prevent failures from occurring or
help you quickly identify and correct them when they do.
NIC and the Network
Whenever a client is added to the backup system, make sure the NIC and the network path can
handle the throughput of the backup. A good tape drive (SDLT, LTO2, LTO3) can clock in anywhere from
30-80MB/s (megabytes/second) and you want it running at its fastest speed - it means your system
will give you the smallest window it is capable of. Lets say you are running 100Base-TX (Fast
Ethernet.) You will have a theoretical maximum throughput over the network of 12.5MB/s. Let's be
optimistic and assume a 10MB/s constant rate (assuming no other bottlenecks in the network.) Even
then, for tape drives that run on the slower side, it would only be running at a third of its rated
speed. If you are using a slower tape drive, then 100Base-T would be suitable but for the faster
drives, the clients should be on a Gigabit Ethernet (1000Base-T) network or fiber. If the NIC or the
network cannot keep up with the BRS it can break your window. For a small saveset, it is not as big
an issue, but it could become one as the data grows. For backing up large storage arrays or JBODs
you will need to make sure your NIC and the network path from the client to the BRS have the
throughput to maximize the tape drives rated capacity.
Tape Drives
If the tape drive cannot keep up with the amount of data you are sending you need to get a bigger
tape drive. If your software can handle multiple tape drives, just add another one and configure the
software appropriately (some packages require additional licensing for additional tape drives.) If
your software can only write to one at a time then a faster one is going to be necessary or consider
a D2D (Disk To Disk) solution. So how do you tell if the tape drive is keeping up? In the case of
program like Legato and NetBackup, the GUI shows how much is getting written by the drive in real
time. If your software doesn't have that capability then there is a simple way to figure it out.
When the network is idle, copy a large file (100+MB) from the client to the backup server using
either an ftp or scp (secure copy) client. When it is done, either will tell you how long it took
and give you a data rate. Do this for any client you choose. Compare your results against the
published speed of your tape drive and upgrade the drive if necessary.
The Incedible Vanishing Client
On more than one occasion I have had a client, for various reasons, not finish its backup. If I
look at the overview window of the backup software everything looks good - I don't see any errors
and there are no savesets pending for any client. The system is idle as I would expect. The only
reason I know a client has hung is because I haven't received an email from the system stating that
all clients have finished. My system is configured to email me every morning when the backup is
complete (I'll go into this a little more later.) If I don't see that email, I know something is up.
This is what I call a silent failure. I wouldn't know if I didn't receive that email and I receive
that email because of a custom Perl script that interacts with my backup software. The greatest
concern here is that data did not get backed that should have. The client usually needs a reboot and
everything is fine. (Note: In my experience, this mysterious type of failure has only occurred on
clients running the Windows OS. I have never seen this error on Unix, Linux, or BSD-based machines.)
I have had the latter fail on me but it is usually a hard, very apparent failure in that I can't
communicate with the client now when I could the day before.
Emergency Backups
From time to time I get a request for "emergency" or last minute backups that need to get added
to the backup session that night - these are for clients I already backup. I won't add a new machine
without knowing the impact and configuring its backup appropriately. While I am quick to
accommodate, the first time I did this I shattered my backup window and those pesky network
engineers let their fingers do the walking. Admittedly, it was my own fault. I point it out because
there are a few things to consider so you don't break the window. Know how much data will need to be
backed up and how fast you backup that particular client. You can probably get both by examining
your logs. If it is a new saveset, make sure you get an exact number from that as reported by the OS
it is on. When I asked, the first time, I was told a few files, not much and I was ok with that so I
added the client's saveset for that night and went home. What I wasn't told was that those few files
were database dumps from a rather large database. Ask specific questions, get specific answers and
configure the backup accordingly. If you absolutely can't do it without breaking the window then
break it. Just let people who need to know that the network will be slow the next morning until it
is done.
The Network Pipe
I alluded to this particular problem in "Nic and The Network" but I wanted to go into a little
more detail here on the network itself. When backing up multiple clients at once, their data enters
the network where they connect and heads toward the destination, in this case the backup server
(Figure 2.) Along the way the data is going to converge at various points increasing the amount of
data downstream from the point of convergence.
Figure 2. Convergence of
multiple-client data.
If at one of these points the downstream pipe cannot handle the amount of data coming together,
it is going to take longer to backup those clients than if the network pipe was big enough. The one
obvious solution is get a faster network, but this is seldom practical except during a major network
upgrade. The other way to handle this is to work with the backup schedule as a big puzzle and figure
out the best way the pieces fit together. This is something to do especially when adding new clients
to a backup rotation. Lets take a look.
If there is downtime for the server in the middle of the window, which can happen over time as
clients are added here and there, some rearranging can be done to squeeze more backup into your
current window. We want to look at the time between when one group of clients ends and another
begins. Looking at a trend across a couple weeks is the best way to determine a consistent gap
rather than one or two nights where the gap was larger that usual. In the group that runs before
that gap, there is extra time for that group to run longer. A client can be added here or the
following group can be started sooner and a client added there. It is also possible to move clients
around among groups to more effectively use the time. The specifics of doing this are dependent on
the system, but the principle is the same regardless of the implementation.
Logs, Logs, and more Logs...
Keeping a watchful eye on the backup systems logs is one of the best ways to ensure there are no
mysterious failures - it is also one of the most time consuming. Examining the logs every morning
will alert you to any problems that occurred the night before. There is an easier way but it will
take an investment of time up front. Create a script, using a language such as Perl, shell script,
or whatever you choose, to parse the backup logs every morning and notify you if there was a
problem. All logs differ between programs so giving any specific detail on how to do this would be
very difficult. It is a great exercise in improving scripting skills as well as making life as an
administrator much easier.
Processes for Managing large Implementations
Power to the Masses
In smaller companies the network, servers, and backup are probably all managed by one group. In
larger companies the backup is most likely still centralized while servers can be in different
departments that are managed by the department administrator. This section is geared more towards
the latter but in some cases will work for smaller companies also.
It is a good practice to place as much responsibility on the client's administrator as possible.
The less time the backup administrator has to spend installing client software and restoring data
the more time they will have to ensure that the restores will be successful. Here are a few
suggestions to lighten the backup administrators load:
When a new client needs to be added to the schedule, provide an image of the client software (via
the web,) email the client's admin with the necessary configuration information and then have them
call if they have any problems. The backup software on the server should be able to tell you if the
client was correctly configured. I find it helpful to keep a close eye on newly added clients for a
week or so, just to make sure there are no hiccups.
Configure the backup server so backups cannot be started from the client. A rogue administrator
can suddenly decide he needs to back up a newly installed database (think large files, lots of tape,
very bad) and initiate it from the client. (Shortly after I inherited my system, I had this happen -
it used up about $500 in tape.) Also, if a "unauthorized" backup is executed during the backup
window, there is a chance the nightly backup will extend into the next day. Maintaining control over
the backup process will prevent unnecessary problems.
When the time comes to restore something, provide a short tutorial in recovery (verbal or
written) and let them recover from the client. The tutorial should cover things such as finding the
correct point in time to recover from, overwriting versus renaming files, and anything else that may
be confusing or could cause more harm than good. I believe there is a sense of empowerment when
someone can restore what was deleted - especially if they deleted it. If a needed volume is not
presently online (i.e. in the jukebox or tape drive) you will need to help of course. Barring
something significant, there should not be a need to get involved.
On one occasion, I had a rather new Unix admin accidentally overwrite the root ( / ) partition on
his Solaris server (NEVER copy anything to /* on a Unix box.) His machine subsequently ground to a
halt and another administrator rebuilt the partitions for him, re-installed the OS and installed the
client software. I received a call to restore a minimal very specific set of files (accounts,
groups, passwd file, host file, etc.) He finished recovering the rest of the data himself.
Part of a good backup strategy is placing yourself in a position where you have time to study and
enhance your Backup and Recovery System. If it is not absolutely something you have to do, let the
client administrators do it. The more hands-off approach may not work for everyone, though,
especially in smaller companies.
Organizing Backups on Tape
Data can be organized just about anyway imaginable. I have found two ways particularly useful for
organizing backup media. Choosing one to use depends upon the needs of the company. Those two
definitive characteristics are association and retention time. An association is a commonality by
which to organize the data such as the department it belongs to or the type of data it is
(financial, database, operating system, email, etc.) Retention time is simply how long the data
needs to be available for restoration.
If tape consumption is a concern (i.e. squeezing a tight budget) then organizing client data by
retention time may be the best choice. Lets say you have data with 3 different retention times: 30
days, 6 months, and 3 years. If data from all three times is placed on a single volume, then it will
be 3 years before that volume can be recycled even though after 6 months there is probably a lot of
free space on it. If however, you place data that is good for only 30 days on one volume, 6 months
on another volume, and 3 years on yet another, then after 30 days, at least one volume will be
available for reuse. What if there are 10 volumes worth of 30-day data? Then there will be 10
volumes per month that don't have to be bought - they can be reused for their effective life or
until it is deemed they are no longer trustworthy. Assuming 10 SDLT or LTO tapes each month at a
per-unit cost of $50, a $500/month savings is realized on media by changing the way data is
organized.
In some cases it is better to store data by association. For example, if a department needed to
keep its data offsite, where another didn't, you would only want the data belonging to that
department on the volumes stored offsite. The fewer volumes you need to store offsite, the less
expensive it will be, so it makes sense in that respect.
Tools of the Trade
Enterprise Backup Software
There are many different programs to choose from to manage the network backup. I would like to
briefly discuss three of the larger ones, clearly aimed at and capable of Enterprise Data
Protection, and then talk about one aspect I think is very important in deciding.
Legato Networker, Veritas (Symantec) NetBackup and NetVault by Bakbone are three big players in
the area of backup and recovery. All three offer support for a majority of the client operating
systems currently in the market. All three offer similar features and abilities across the board.
The one major difference I found is that NetVault is the only one (as of this writing) that supports
OS X as a server and not just a client. If the network is 100% Mac, then of these three, NetVault is
the only viable Enterpise solution. If more options are necessary in an all Mac company, then a
Linux server may be an option to consider.
The one aspect I think is worth considering is the ability to script custom solutions that work
with the enterprise software. I am most familiar with Legato since that is what I use. When certain
events happen, I have custom written Perl scripts that get executed. Depending upon what event just
occurred the scripts do different things. One script in particular will examine the log files and
determine if any client failed during the window. I get an email with a list of clients that had a
problem or an email stating everything completed successfully.
Regardless of which package you choose consider how scripting custom solutions can really tailor
the BRS for your company's needs.
Keeping Safe Backups
Even with data being safely on another storage medium, there is still risk that some event such
as a fire, storm, natural disaster, or industrial accident could destroy both copies of the data.
Here are some ideas on how to add another layer of protection.
Media Safes
Media safes are good protection from the damage of fire and humidity. Safes designed specifically
for media are more expensive than their equivalent document counterparts because media is more
susceptible to damage at lower temperatures. I would consider a media safe or equivalent storage a
bare minimum in protecting data. When choosing a safe make sure it has the approval of Underwriters
Laboratory. UL Standard 72, "Tests for Fire Resistance of Record Protection Equipment" discusses the
scope of testing performed on these devices and what is required for them to receive the approval.
More information on UL standards can be found at www.ulstandardsinfonet.ul.com
Off-site storage
Another alternative is to contract a vendor that specializes in the storage of backup media. I
have no personal experience with this type of situation so I can't give specifics on the pros and
cons. Recent news coverage of a couple large companies that had their sensitive data misplaced by
the courier almost mandates that if you choose this option make sure your data is encrypted.
Mirrored Storage
Having backups mirrored at an alternative location is a very good way to ensure that you can
recover when you need to. Many storage companies have software that automates this process and makes
sure the two sites stay in sync. Apart from the expense of having the secondary site the connection
between them needs to also be considered. It needs to be large enough to move the amount of data
necessary.
Summary
In the Quest for the Holy Grail of Restoration, time and energy is well spent in making sure the
backups are successful. With a good backup strategy and some customized automation, the time
required by the administrator can be well managed and failed restores can be a vision of things
past.
Carvel Baus has been an avid Mac enthusiast since his first Mac programming course in
the early 90's. He has spent his career programming and administering Unix and Macintosh systems. He
currently works for CSSI, Inc., a technical and engineering services company specializing in systems
analysis and engineering, airspace analysis and modeling, and information and project management.
You can reach him at cbaus@cssiinc.com.