Chapter 21

Disaster Planning for Networks

Disaster planning is often one of those things that gets talked about, planned, and then never accomplished. Try questioning a group of network administrators about disaster planning. Ask them, "How many of you have disaster plans in place?" and you'll see some hands up. Ask, "How many of you have written disaster plans in place?" and many of those hands will go down. Ask, "How many of you have actually tested those plans and revised the parts that didn't work?" and most hands will go down. It's also a good bet that at least one of the people whose hand is still up is lying.

No one likes to think about the possibility of the network or the file server going down in flames, but it's better to dwell on this depressing possibility beforehand than to explain afterward to the boss why you weren't prepared for data loss. Therefore, in this chapter, we'll discuss the important elements of any disaster plan, who should be involved in its preparation, and how you can prepare for various disasters. This chapter can't write your plan for you, but your job should be easier.

The Nature of Disaster

"Disaster" is such a dramatic word that it might sound overdone. How likely is it that disaster could touch your company? If you think of disaster only in terms of natural disasters or massive hardware failures, then disaster might not seem very close at hand. If, however, you consider disaster in terms of anything that could stop the company from functioning for an undetermined length of time, then the notion of disaster is easier to accept.

Although this book is ostensibly about networking, the disasters that could slow or stop your company are not limited to network or data-related problems or even to events that happen at your site.

Broadly speaking, three ca P>

These categories are not mutually exclusive. A disaster that apparently fits into one category could be slotted into another category because of its cause. For example, the slant on a power outage changes based on whether the cause is sabotage at the power company, an electrical storm, or a failed switch at your building. For the purpose of recovering from a problem, however, it really doesn't matter what caused it.

Events

Downed power lines, broken water mains, fire... Natural disasters don't have to be as dramatic as earthquakes swallowing your company headquarters. Hurricanes threaten coastal regions of the southeastern United States almost every year. If your office shares a building or office park with others, a fire starting in a neighboring company can threaten your operation. Even a broken water main can render your office unusable and possible destroy data.

Natural disasters don't always directly touch your office, but their effects may be felt there. An area power outage can render your office unusable; a nuisance even if every server is equipped with a UPS and able to do an automated orderly shutdown. Even if you don't experience any data loss, how can your office work if a bolt of lightning blows the power out?

Breakdown

Technological failures are often the easiest disasters to anticipate and prepare for. The simplest plan usually is redundancyóbackups, hot-start servers, and alternative office sites are fine examples of how redundancy can help you overcome some breakdowns.

Breakdowns aren't limited to network equipment. In a heat wave, you can't run computers without air conditioning. Breakdowns don't have to mean that the equipment's actually broken, either. If a virus renders your network server unusable because you're afraid of spreading the contamination, the server is effectively broken, even if it boots up fine.

Behavior

Human-related disasters are probably the hardest to prepare for. If you know your geographical region, you can prepare for weather. If you know your equipment, you can prepare for hardware failures. How do you prepare, though, for a bad outbreak of flu that keeps half the work force of your communityóincluding tempsóhome for days on end?

Technological breakdowns can, directly or indirectly, be caused or exacerbated by humans. If the backup operator hasn't been doing his or her job, then a technological problem becomes a disaster when the server's hard disk fails and you need to reload the backups. Another behavior-caused technical disaster could be unlicensed software. If the Software Publishers Association catches your company with unlicensed software on company machines, they can temporarily incovenience your operation and levy very steep fines. It's odd to think, though it is illegal, that having too many copies of Microsoft Word 6 for Windows could destroy your company, but if a disgruntled employee calls the SPA on you (the SPA gets most of their information from tips like that), then it's a definite possibility.

Key Elements of a Plan

Five things are crucial to an effective disaster plan:

These are fine words, but what exactly do they mean?

Support of Upper Management

Before anything else, you must have the full knowledge and support of upper management. Disaster planning is not something that you can prepare as a surprise for the boss. It requires too much in the way of funding and the type of insights that the boss is privy to. She may like your initiative in coming to her with the need for disaster planning, but don't even think about starting without the go-ahead from upper management.

Clarity of Purpose

Before beginning to write your disaster recovery plan, you must determine what you're trying to accomplish so that you can define what you must do and how much it will cost.

Before considering these questions, some sort of risk analysis is in order. What kinds of disasters are likely to strike your organization? After all, there's no point in preparing for something that isn't likely to happen. If your company is in the Midwest, landslides probably aren't much of a problem, but tornadoes could be.

Don't eliminate potential disasters out of hand without some serious thought. Things change, and events that were unthinkable ten years ago could happen today. For example, in times past, ambitious thieves mostly went after banks or drug companies. High-tech centers, such as those manufacturing or distributing memory chips, are among the new robbery targets. If you haven't considered how you'd manage a complete loss of your entire inventory, think about it.

***

Disasters don't always occur at night or on weekends, and some of them can be hazardous to people as well as to businesses. If at all possible, maintain some emergency supplies on-site, ranging from food and water to first-aid kits and common tools. A flashlight (with extra batteries) is always a good thing to have around.

What Should This Plan Accomplish?

What is the purpose of the plan? Is it to keep the network going under any circumstances or just to protect the company's informational assets from going up in smoke? This may seem to be a silly question, but the answer really drives the rest of the planning. If the goal of your disaster plan is to provide unlimited security for your company's operation, you should prepare yourself to spend unlimited funding. If the purpose of your disaster plan is only to maintain data integrity, the task becomes simpler and less expensive.

Think hard before settling on a definite purpose. A lot is involved in getting a company up and running, and reproducing it when the original plan doesn't work isn't an easy task. At best, a disaster plan ensures the continuity of your company's operations, but what exactly is involved?

Maintaining Continuity

The fundamental element of any disaster recovery plan is that it should be "business as usual" as soon after the disaster as possible. Continuity isn't just about doing backups or maintaining a spare server; it's about everything all the way up to a predetermined chain of command in case the CEO dies in a plane crash. Change is exciting, but unplanned change can be disastrous to a company's functioning.

All the information you need to keep the company going after disaster strikes should be part of your written disaster plan. This includes information like home telephone numbers for employees and contractors, vendor numbers in case you need to replace equipment, a chain of command, and so forth. At least one copy of this information should be maintained off-site, somewhere accessible.

Notification Procedures

If disaster strikes your office, who needs to know about it?

Everybody. Those charged with solving the problem need to get the word and fix it. Those employees who don't already know of the disaster need to know what to do about coming to work: should they take the day off, report elsewhere, or work at home if possible? If your office has a customer base, you may need to inform them if the disaster impinges on your ability to deliver products or services. (Of course, you'd probably rather not tell customers this stuff, but sometimes it's unavoidable.)

The point is not just to tell people that a disaster has occurred, but to inform them what's being done about it or what they should do. It's not nearly as bad to tell your customers, "The office has flooded, but we've got your proofs ready for pickup at another location," as it is to tell them, "The office flooded last week, so we lost your order and the work already done on it. Can you resubmit the order?"

Communication is sometimes one of the first losses to disaster, just when it becomes most important. In a nutshell, the key people to notify are

Preserving Data

Backups are probably the first thing you think of when it comes to preparing for disaster. Have you given any thought, however, to the data your office stores that can't be backed up? Not all the important data in your office is kept on the server. Much of it is probably not even electronic. Your company might have any or all of the following documents only in hard copy:

How will you recover critical hard copy documents in case of disaster? It's better to make sure that you don't have to by using redundant off-site storage or a fireproof safe.

What's Critical?

Not every function is equally essential to keeping the company rolling, and the definition of what's essential may change from enterprise to enterprise, or over time. Given that you can't do everything at once, you need to prioritize. What needs to be restored first, and how fast must it be brought up to keep the company going?

What Resources Are at My Disposal?

Unfortunately, what you want the disaster plan to accomplish and the resources you have available might not always match. Although it would be nice to be able to maintain an entire network at a separate site so that, in case of fire, the entire company could just switch offices and life could go on as usual, the cost and complexity of such an arrangement render it no more than a dream for most enterprises. The U.S. government spent millions on a network (the Internet) that would stay running even if large portions of it got bombed, but most of us have business managers who hide the checkbook and snarl when they see us coming. Therefore, once you know what you want the disaster plan to do, you've got to talk to your business manager or CFO to find out how much you can spend to make it do that. Once you know what you want and know how much you've got to spend, you can either revise the plan accordingly or negotiate for more funding.

No Finger Pointing

Disaster is bad enough; ambiguous responsibilities make it worse. When creating your plan, decide ahead of time who's responsible for each recovery task and exactly what that responsibility entails. For example, say that your plan makes the administrative assistant responsible for backups. Does that mean that the administrative assistant performs backups, stores backups, verifies backups, or all of these? Is he responsible for all the backups on the network or just one server? Who's responsible for any files that users store at their workstations?

As you can see, once you start asking questions, defining the scope of responsibility gets increasingly difficult. Do it now, so that when the server dies and it's discovered that the backups were cooked on a shelf in the sun and are totally useless, you know who's responsible. Better yet, if you define the scope of the task ahead of time, then you can make sure that whoever's got the responsibility knows how to do the job properly. If you know who's storing the backups, you can train them before the backups get fried.

Determining responsibility is important not only before disaster strikes but during the recovery stage as well. If the first person in the office on Monday discovers that the server died over the weekend, what procedure should they follow? To maximize your chances of recovering from disaster, everyone in the office, not just those who created the disaster plan, needs to know what to do and whom to call when something goes wrong. This means written instructions and prior training for everybody.

No Single Point of Failure

Well-defined responsibility is important, but ensure that the success or failure of the entire plan doesn't rest on one person's shoulders. Redundancy is a good idea for all elements of a disaster plan, not just the hardware. Make sure that the plan won't fail if one thing goes wrong. This means the following:

You can probably add to this list yourself. The bottom line is to make sure that your disaster plan won't fail because the business manager is on vacation with the key to the backups.

Flexibility

Finally, a good disaster plan is flexible and can adapt to change. This is important because, frankly, it will have to. Companies grow, personnel come and go, and hardware changes. If your disaster plan details an excellent recovery system that's too rigid, it won't remain useful for very long.

If you've paid attention to the other three key elements, this one shouldn't be difficult. Determine exactly what the plan is supposed to do, who's responsible for what (and what that responsibility entails), and make sure that you've got a fallback position if something goes wrong. Once you've done that, test the plan and update it annually or as needed. With this kind of preparation, very little can take you completely off-guard.

Creating the Disaster Planning Team

Thus far, we've been talking about a disaster plan as though it were the creation of one person. That's not accurate at all. For a plan to be really effective, it requires several sources of input:

How many people crowd the tableóand who they areódepends on the size of your enterprise. This list might mean the official CPO, CIO, CFO, and CEO, or it might mean the de facto holders of the same positions. One person in your organization might fill more than one of these jobs, but no matter how responsibilities are allocated in your enterprise you really can't create a useful plan without input from all these sources. Let's look at the role of each of these players in turn.

The Personnel Manager

The personnel manager, in this instance, means the person in charge of the staff. The disaster plan needs his perspective because he is the person who knows most about what people are doing with the network. He should know what applications the staff uses most often, what time everyone comes in (and therefore the time that it would be nice to have everything fixed by), and in general, what everyone in the office uses the network for. This should help you prioritize if time is short and you have to figure out what to fix first.

The personnel manager can help prepare the staff for disaster, too. Training is key to surviving a problem, and the personnel manager is the most likely candidate for making sure that this training is accomplished. The personnel manager should make the following contributions to ensuring that the staff are prepared:

The Network Administrator

The network administrator is the technical voice of the disaster planning staff. Of all the people creating the plan, she's the one most likely to know what recovery hardware and software are available, how much it all costs, and how to research other possibilities. Although the network administrator isn't likely to know everything about all company assets, she can help the business manager (whose role is described later) with details like preferred vendors and the computer needs for an alternative site.

Keeping tabs on the components of each machine on the network is another part of the network administrator's job, and this kind of information can prove valuable to the planning process. If, for example, the main server has an IDE controller, there's no point in buying a SCSI tape backup system for it unless you also purchase a host adapter. If this seems like an obvious point, think again. At one company I know of, the business manager purchased two new hard drives and controllers without first checking with the network administrator to see what slots were open in the servers those hard drives were for. When the day came to replace the hard drives, one of the servers had no VL-Bus slot available, rendering its new controller useless.

The network administrator must also work as the voice of reason. As the person who presumably has the most computing and networking knowledge, she has to explain to the others on the team why it's not practical to replicate all directories onto another server via a modem or what the current capacity limits for tape devices are. This kind of technical advice is vital to both the success of the plan and the network administrator's sanity. Much of the responsibility for developing and implementing the plan will likely rest on her shoulders.

The Business Manager

The business manager might have to act as another kind of voice of reason. As the person with the best idea of the company finances, the business manager has to take the pricing information that the network administrator provides and balance it against how much the company can spend on a disaster plan. Although it would be nice to believe that there's no limit to how much your company is prepared to spend on this worthy cause, that simply isn't practical. There's no point in the company running short on operating expenses in order to over-prepare for disasteróit's the job of the business manager to make sure that this doesn't happen. The personnel manager wants to make sure the staff is able to work; the network administrator wants to keep the network going, but the business manager is responsible for making sure that the company doesn't go broke fulfilling these goals.

Part of not going broke in the face of disaster lies in making sure that appropriate insurance covers the possibility. The business manager should bring all relevant insurance information to the meeting, including coverage data for

The business manager can provide other information that's useful to the disaster plan. As the person in charge of purchasing, the business manager should have some kind of inventory of the company's hardware assetsóeven if the network administrator knows what's in each machine, she may not have a comprehensive list. Why does this matter to a disaster recovery plan? Two reasons, actually. First, a hardware inventory makes it easier to know how old all the hardware is and prepare accordingly. Second, a hardware inventory is useful for knowing what you've got around to fix the network or individual PCs. To keep the list accessible in case of network disaster, don't keep it in a database on the serveróan erasable whiteboard or chalkboard works betteróand a copy of the complete list should be stored off-site and updated periodically.

The Boss

The boss, whether the company owner or just the general manager at your site, is ultimately responsible for what happens at your installation. He should have a good, broad view of the needs of network users, the money that's available for the project, and the capabilities of the network staff. The other members of the team can advise, but keep in mind that final decisions lie here, as this is the person who could go out of business or be first in line for firing if an unrecoverable disaster occurs at the site.

Enough about planning for disasterówhat can you do about it?

Hot and Cold Sites

In a disaster, the biggest problem might not be how to fix the problem, but rather how to continue with business as usual while fixing the problem. Some organizations aren't dependent on a 9-to-5 schedule, and if a network disaster keeps them closed on Monday it might be possible to make up for it with a double shift on Tuesday. For others, especially service-providers or those who work on daily deadlines, shutting down for a day can be more than an inconvenience and a late night to catch upóit can cripple company performance. After all, if you're providing a service like a newspaper or mail order, then as soon as you can't provide the service, your customers will go elsewhere. They'll start subscribing to the cross-town paper, or order their products from a different company, and you'll have lost them.

To avoid this kind of mishap, some enterprises that cannot afford to shut down for any reason maintain a second site to move into if the original office isn't working. These second sites (obviously demonstrating the precognitive talents of their network administrators) fall into one of two classes: cold sites and hot sites.

Cold Sites

There isn't really a hard-and-fast definition for a cold site, but it's generally acknowledged to be a site that can be a functioning office with a little work, but isn't ready to go at a moment's notice.

Cold Site Considerations

Cold sites are the compromise between the expense and maintenance requirements of a hot site and a closed office when the network's down. As such, they've got several tradeoffs to consider.

First, maintaining a separate office isn't cheap. To get the most effective use of your investment, consider leasing a site in another building from your main office. That way, if it's something building-related, your cold site won't go down along with your main office. You might be able to reduce costs by subletting the site, but only if the persons leasing it understand that your office might kick them out at any timeóyou're not very likely to find tenants on these terms.

Second, perform regular maintenance checks on any site equipment. If you store extra machines there, make sure that they're still there and that environmental problems like leaks or fire haven't rendered your spare office useless.

Third, plan ahead. Think about what you'll need to bring with you if you must move the office operation to the new site. Most likely, you'll need more than computers. How about telephones? Paper files? Office supplies? Find out what people use every day and see if there's any way to take it along. (Of course, if the main office burns down, you won't be able to supply the second site from the first, but most disasters that could halt work for a day or two aren't quite that destructive.)

Cold Sites in a Branch Office

You might not have to rent another office for your cold site. If your company has a branch office in the area, you may be able to accommodate your people there, at least for a short while.

I know of one company with an arrangement like this. The main office and a smaller satellite office are in the same city, not too far from each other. The local power company once had a bad line day, and the three people in the satellite office were able to pick up their machines and move to the main office. When they arrived, they plugged into the network (a 10Base2 Ethernet, same as the one at the satellite office), restored the files that they needed to the local server from their own server backups, and were ready to go.

Of course, this only worked because of previous planning:

If you've got two offices in the same area, and enough space to squeeze in some people in a pinch, this kind of cold site might work for you. If nothing else, it keeps down the cost of an extra site. It might not be fun or convenient, but at least it keeps people on the job.

Hot Sites

Cold sites are good, but they take time to set up and get operational. If your organization must have nearly no down time in case of an emergency, you probably need a hot site, which is one that's ready for your office to move in and begin working immediately.

At best, a hot site should have computers, loaded software, telephones, file cabinets, and so on. It needs all the things that your office uses every day but doesn't necessarily think about. If at all possible, this equipment should be set up in the form of an office, so that almost all you need to do in order to get the office working again is to bring in the staff. Your network should be cabled, workstations arranged, backup tapes ready to restore to the server, and so on.

Hot sites are convenient, but for most of us their cost outweighs the advantages. First, the maintenance required on a hot site is more intense than that for a cold one, because you're replicating the entire network and therefore must make sure that each day's backups (if your organization requires a hot site, it's a pretty good bet that you're backing up every day) are ready to install on the server. This kind of preparation takes time.

Second, the costs of maintaining an entire additional office are high. Your best bet might be to share a hot site with another office with similar needs, and hope that you both don't need it at the same time.

Planning a Site

Whether a hot or cold site best fits your organization's needs, you'll need to figure out what your office requires to function every day and how you can get it to them. If your company maintains an alternative site, that site must be able to fulfill the following functions:

For this to be possible, you need to make sure that the disaster recovery plan includes the following information about the site:

Preparing for Viruses

Since they started getting a lot of press about ten years ago, computer viruses have been a major concern for businesses. In 1991, I saw one office shut down its computer operations entirely for a day, because it was Michelangelo's "attack date," and they weren't taking any chances. (None of their machines had tested positive for the virus, incidentally, and all were stand-alones except for one on a line to their main office.)

The trouble is that this kind of reaction is exactly the one that virus authors are looking for. Imagine, if you were addicted to that kind of power trip, how good it would feel to know that you'd shut down a large portion of corporate America for a day, just because of a threat. I admire the tenacity and talent of some virus authors, but I think that writing virus programs is a totally asinine way to spend this tenacity and talent. The best way I can think of to encourage them to find another outlet for their skills is to spoil their fun. To that end, let's talk about some ways in which you can avoid virus attacks, or recover from them without much difficulty.

Understanding Viruses

Computer viruses are not supernatural. They possess no intelligence and can't possibly do you any harm unless you let them.

How would you let them? By letting an infected floppy (the most common means of infection) into your computer. A virus can only get to your system in one of two ways: during bootup or if you run the virus's program file. Essentially, your computer must access the virus program in some way in order for it to take effect. Whether this happens by booting from the floppyóeven an aborted boot usually infects your systemóor by running the virus program is irrelevant. (By the way, an infected floppy doesn't have to be a boot disk to infect your machine by leaving it in drive A while rebooting. Any infected disk might do the trick.)

Copying noninfected files from an infected floppy or doing a DIR on one does not spread the infection. Although the virus might show up in memory after you've done a DIR on an infected floppy, it's not written to disk until you put it there.

How Does a Machine Catch a Virus?

Not everyone understands how infection works, so it's worth explaining. Suppose, for example, that I go out to a client site to fix a network. While I'm there, I borrow a floppy disk to copy some code that I want to review at home. This floppy has a boot sector virus like B1 on it, but I don't know that. I bring the floppy home, and stick it in my floppy drive. I'm not infected yet. I do a DIR and pull off the files that I need, reading them on my machine. Still no infection. I write to the files on the floppy. Not a problem.

Then, I need to reboot for some reason. With the disk in drive A and the door closed, I press Ctrl+Alt+Del to restart the computer. When the computer halts during the reboot, I notice the familiar Non-system disk message on my screen. "You fool," I chastise myself and remove the disk to finish booting.

Now, I'm infected. I've made my computer access all the files on the floppy, looking for system files to boot from, so even though none of the files were bootable, the virus took hold.

The interesting thing is that, in most cases, I still don't know I'm infected. Not all viruses take the first opportunity to wipe out your MBR, format your drive, or display cutesy messages. Viruses can lie dormant for months or even years, waiting for their trigger: a particular date, a certain number of "copy" actions, a particular key sequence, or whatever. In this case, I might not find out that I had a virus until the next Friday the 13th arrived, or until I typed the word "Reagan."

I think that this dormant stage of viruses is what makes them so scary to many people. You trust a computer with your data, and the idea that it might suddenly turn on you and format your hard disk at an inappropriate time is more than a little unsettling. Let's talk about how to avoid this unsettling behavior.

Preventing Virus Attacks

The comparison between biological viruses and cybernetic ones has been made so many times that I refuse to make it again, but it's pretty accurate. If you're living in a germ-free environment, you won't get sick; if you're computing in a germ-free environment, your computers won't get sick. A moderately strict quarantine system helps you avoid most viral infections, without having to institute Draconian measures that your users want to circumvent. Let's take a look at some of the measures that real-world enterprises are using to prevent virus attacks.

Quarantine Servers

Today, when more than a few people work at home at least part of the time, it's impractical to forbid people to bring floppies to work, but you can insist on a quarantine period. Create a company policy requiring people to drop disks off for checking when they first bring them in and before sticking them into a floppy drive.

Then it's up to you to virus-scan the floppies as quickly as possible. Do it early in the day if you can so that people don't start ducking the floppy bin because of delays in getting to use their data. If you've got an extra stand-alone machine that you can use as the virus-checker, do so. If not, just run the virus checks on any machine that you know is clean. If the scanner detects a virus, clean it.

If you follow this procedure for every floppy that comes in the door of your office, you should be able to prevent most virus attacks.

Educate Users

Educating your users about virus attacks goes a long way toward preventing virus attacks upon your network. Tell them what viruses are, where they can come from, what they can do, and how the measures you're instituting will prevent infectionsóbut only with help from them. Instituting a mandatory disk-scanning program, for example, won't help at all if no one but you understands why you're doing it.

Users should know to do the following:

With rare exceptions, the users on your network are not out there to ruin your day. If you tell them how to avoid infecting the network's machines and you explain why it's important, chances are they'll cooperate. If you act as though they should already know the reasons, you're more likely to meet resistance.

Running Effective Virus Scans

You can perform a real wall-to-wall search for viruses on a machine if you do it right. First, before scanning a machine's hard disk, format a floppy to be a system disk and then copy your favorite virus scanner onto the floppy. Next, cold boot (turn the machine off and then back on) from the floppy to restart the machine. Then run the virus scanner. Why cold boot the machine first? Because some viruses can fake a reboot unless you've actually turned the machine off.

Second, make sure that you update your virus scanner regularly. Most makers of anti-virus software run BBSs of updated virus signatures. Download these on a regular basis. Remember, there's no generic "there's a virus here" signal; most scanners work by looking for signatures belonging to specific viruses. Even if it's a common virus, your scanner can't find it without knowing how to look for it.

Another reason to update your virus scanner's signatures regularly is to avoid false alarms. One government shop was preparing a number of black-and-white graphics for a presentation and had a virus checker running in the background. When the graphic artists began reading one file into memory, it set off the virus checker because the data pertaining to the large chunks of black contained long strings of 0s. The virus checker hadn't been updated for a while, and it identified the strings of 0s as a virus signature. The artists, a fairly computer-literate bunch themselves, had a heck of a time convincing the computer security types that the file itself might be setting off the virus scanner. It took several days to get the virus checker updated, but when the file was reread on the machine with the updated virus scanner, the alarm no longer went off.

Removing Viruses

This topic is covered in detail in the virus chapter, but there are a couple of points worth reiterating:

If you follow these rules, you'll detect more viruses and hurt yourself less trying to remove them.

Data Loss Planning

Preparing for data loss isn't very difficult and doesn't have to be expensive, but it's not always the top priority for either network administrators or their managers. There might be a feeling that, if the network is functioning properly, there's no reason to plan for data loss, but it doesn't take much imagination to think of plenty of situations in which you might need data backups even if the network is running fine:

Malice, mistake, or mischance can destroy your company's data, and that's without even taking routine hardware failures or overwritten files into account.

You can prepare for data loss on your main server using either (or both) of two approaches: dynamic data replication of some sort (such as disk mirroring/duplexing or directory replication) or a backup plan.

Dynamic Data Replication

You have learned that there is not a lot of planning involved with this duplicating your data as it's written: you decide how much protection you need, and then balance that against how much you can afford (although hard disks are getting cheaper all the time, any kind of redundancy requires an additional investment, and not all operating systems support RAID without additional software) and what kind of performance degradation you can live with. Once you've chosen the method that's right for you, that's pretty much the end of the story.

Backing Up

Even if you're dynamically replicating your data, a backup plan isn't dispensable. Although it's not as necessary for data integrity, a backup plan can be a useful archiving tool. Moreover, because many companies can't afford the hardware investment inherent in RAID, or the network degradation inherent in replication, backups are the only option for preventing data loss.

Chapter 17, "Backup Technology: Programs and Data," discussed the various kinds of backup hardware and software at your disposal. Armed with that information, you know what combination best suits your needs and your budget. At this point, we can tackle the question of how to put this backup hardware and software to best use in a backup plan.

Creating a Backup Plan

When thinking through the backup plan for your network's server or servers, ask the following questions:

Your answers to these questions should help determine how stringent your backup plan needs to be, who's going to be involved, and how you're going to make sure the backups work down the line.

Which Data Matter Most?

Backup plans are about data, but not all data is equally crucial. You've got limited time and money to spend protecting your company's data, so make sure you're protecting the right stuff.

Look at the distribution of data in your office and consult with the CEO and others who use the data. Which data is vital to the survival of the company? Protect that first. Which data is important but can be re-created without too much difficulty or expense? Which data is important more for historical purposes (like files containing old memos) than for immediate use? Which files are easily recovered if lost? This is not to say that you shouldn't back up all data whenever possible, but given a choice between backing up the application server holding only a few programs or the server holding all your accounting files, the latter should be backed up first.

Are You Protecting Too Much?

If the concept of too much data protection sounds odd, think of this: what happens to network performance when the file server is busy doing something? It gets slower, right? Therefore, even knowing that if the server goes down at 3:00 PM you might lose all the data entered since yesterday's daily backup at 10:00 PM, you might not want to run continuous backups all day. For that matter, because a lot of backup software can't back up open files, you couldn't get a complete backup even if you ran backup after backup all day.

You can experience data loss even with the most complete backup system imaginable. The key here is to figure out how much you can afford to lose and then balance that against how much effort you're willing to put into data preservation. For most of us, a daily backup is good protection. If your enterprise's needs demand more frequent backups, consider another form of data protection, such as directory replication or disk mirroring.

Who Does the Backups?

Backups are too important to say, "Some time during the day, somebody stick a tape in the drive and turn it on." Backups done under such circumstances are apt to be done haphazardly or not at all. To avoid data loss, you need to make sure that you allocate the responsibility ahead of time and teach whomever is responsible how to do the backup and make sure that it's complete.

The person doing the backups may determine to some extent when it gets done. Much backup software has a timing mechanism built into it so that you can delay backups to a time when network activity is at its lowest. If your package does not have such a mechanism or you're having trouble getting it to work properly, you may want to assign the task to someone who's there earlier or later than most people, so that backing up files doesn't slow down the network. In general, though, you should arrange for backups to be done when no one is accessing the server. As noted previously, most backup software can't copy a file and flip the file's archive bit when the file's in use.

Protecting the Protection: Storing and Verifying Backups

A damaged or corrupted backup is as bad as having no backup at all. Therefore, it's vital that you store the tapes, cartridges, or disks containing the archived files somewhere safe; also, test them periodically to make sure that you can restore them.

What's a good storage place? Well, the basics are simple: if you wouldn't like the temperature and humidity of the storage area, your backups probably won't, either. Electronic media do not like direct sunlight, extreme heat or cold (although cool temperatures are all right), or extreme dampness. If you keep your backups on-site, normal office conditions should be good for them, as long as you don't place them on sunny windowsills.

Your enterprise might have particular security needs relative to backups. If your data is particularly sensitive or liable to theft, then locking up the backups either on-site or at a secure location is a good idea. If you must lock up backups, keep the number of keys to a minimum, but make sure that a spare is available. The middle of a crisis is not the appropriate time to find out that a key fell off your keychain, so you have to bring in a locksmith on short notice.

Always label your backups with the date, source server (if you have more than one), and the type of backup. Further labeling is necessary if your server has more than one drive or if you're in transition between two types of backup software. For example, the backup made on July 15, 1995 from server ACCOUNTS might be labeled the following way:

07/15/95 Full Backup of ACCOUNTS (Backup Exec)

From this, you know how complete the information on the tape is, how recent it is, which server it belongs to, and which software must be installed to restore the backup.

Some organizations update the information on the backup label every time the media is rewritten, and some name media only once but keep a schedule of which tape is being used for a particular backup. Whichever method you use, make sure that you can quickly and accurately determine the tapes that you need to restore your data following a disaster.

Practice Restorations

It's a good idea to practice restoring data from the backups. Practice determines that you're sure of the restoration procedure and can do it with confidence when you really need to. More importantly, this practice confirms that the backup works. Although the "verify" feature that some backup software has is useful for telling you that what's on the tape after the backup matches what's on the hard disk, don't trust any verification totally until you've successfully restored files from the backup. After each backup is complete, choose a file that won't have changed in the interim, and see if you can restore it to the drive from the backup medium. If the test is successful, you can feel much more secure that the backup is good.

Archiving Old Backups

How many generations of backup tapes do you need? Depending on the kind of data that your enterprise produces and the likelihood that you'll need to restore old data, you may need to keep previous generations from as few as two previous backups or from as many as six months' worth of backupsóeven more than six months' worth for organizations using a backup system to archive seldom-accessed data for the purpose of saving hard disk space.

Keeping lots of archives has a few disadvantages:

For some installations, however, the advantages are more important:

These advantages are more than hypothetical. Some information, like annual reports, might only be referred to once a year. If the file that these are in gets corrupted, will you be able to re-create the data? Or your art department might have created a series of drawings for a company publication; once the publication's out the door, everyone can forget about those drawings...until the next edition of the publication is due a year later and the art department finds that the drawings are gone. As you can see, archived backups can save your neck in some situations.

If you're going to keep many generations of backups, organization is crucial. You might consider keeping one series of tapes in the active backup file and another series in an archive file, so that you're not trying to manage tape rotation for fifty or a hundred tapes. If you've got more than one server's backups to maintain, organization and good labeling is even more important.

One final note on archiving: magnetic media doesn't keep its data forever. After a while, the data on the tapes or cartridges will fade. To keep your archives useful, cycle the tapes when the archives expire and format when recycling an old tape.

Making Backups Easy

No matter how good your backup plan is, it won't get done if you don't make it easy. That means plenty of easily-accessible tapes, good training for each person doing backups, and backup software with a good interface. Whoever does backups should have a work area close to the server, if possible. If your server is in a separate room, try to give the backup person a work area not far from it.

Timing also plays a part in backup programs that are successful. The shorter the interval between backups, the more likely it is that the person doing them will remember, especially at first. (There are limits to this, of courseóonce a day is probably the shortest desirable interval between backups.)

Don't Skimp on the Software

You'd be surprised how much difference your choice of backup software can make. If your backup software has useful error messages and is easy to manipulate when something isn't working right, then the backups will get done even if there's a problem because the problem will be easy to fix.

I know a network administrator who recently purchased a good backup package with a very confusing interfaceóit wasn't a GUI, was arranged differently from the backup software on the other servers and was generally difficult to use and even more difficult to fix. One day, the person doing the backups noticed that the scheduling part of the program wasn't working right, and that it wouldn't accept the next tape on the schedule. She notified the network administrator, but a number of things kept him out of the office most of the time for several weeks, and he didn't have time to fix the problem. No one else in the office knew the software, so the problem remained. You can probably guess the ending. The hard disk on that server crashed one day, and the only reason that it wasn't a total loss was that some people working from that server noticed that the disk sounded funny and copied some working directories to their local drives fifteen minutes before the hard disk was silent forever. Some data was still lost, however. As you can see, there were two culprits here: a backup program with a lousy interface, and a single point of failureóonly one person knew the software, and he didn't fix the backup problem in time.

Implementing and Following Up on the Plan

Once you've drawn up the basis of your backup plan, you get the ever-entertaining task of making sure that the plan is carried out. Whether you're the person doing backups and maintaining archives is irrelevantóyou've got to keep an eye on the plan and make sure that it really protects your enterprise's data as well as you'd envisioned. Implementing and following up on your backup plan involves several different tasks.

Post a Schedule

Post a backup scheduleóa written one, not a mental one. Every time that you (or whoever) performs a backup of the server, write down the date, the tape used, whether the backup verified successfully (aside from normal anomalies, of course), and your initials. Keep this schedule somewhere highly visible, like on the wall above the servers or a shelf above the desk of the person responsible. The purpose here is twofold: first, it encourages the person doing the backups to be extra-careful about making sure that they're done, and second, it means that you always know when the last backup was done (not just when it was supposed to be done), as well as what type of backup it was, and whether it was good. Thus, when the server crashes, you can tell at a glance exactly where you stand.

Take time to design your backup schedule well. Make sure that, just from looking at it, you get an overview of the backup scheme (this saves you from continually reminding people) and can see exactly where you are in the backup process. Each backup should have its own entry, with a place for the backup operator to sign off and verify that the backup is ready for restoration.

Test the System

Make sure to do trial runs periodically. Restore unchanging files from the backup, to make sure that you know how to do it, and to check that the files restore okay.

Keep Backups Handy

Even if you keep some backups off-site for safekeeping, keep the most current one on-site (unless this is totally impossible for some reason, such as a security concern). Although you don't want every copy of your data to go up in smoke if the office burns down, you also don't want to make restoring backups any more difficult than necessary. Keep the most recent backup readily available, and the second-most-recent can stay in storage in case something destroys both the original data and the most recent backup.

Know Where Your Data Is

Keep aware of changes in the whereabouts of your organization's data. In January, the server by the printer might hold little important data, but by August, its data load could change enough to require daily backups. Admittedly, if you're the network administrator you aren't likely to be totally surprised by this kind of change, but you might not immediately think of it as a reason to change the backup plan. While you're thinking about which server's got the important data, don't forget to monitor disk size and adjust your backup system accordingly.

Keep Up with the Technology

Next, keep an eye on the available technology. Just because something works extremely well one year doesn't mean that a better solution couldn't come along later. For example, one company began its backup program in the late 1980s with a Bernoulli box. The box was marvelous in its day, but a couple of years later they had replaced their server's 325M (remember, this was a few years ago) hard drive with a 1.2G monster. 90M Bernoulli cartridges weren't going to be much help with something that big, and the cartridges were kind of unwieldy, so the company purchased a 2G tape drive. The new tape drive was the greatest thing since sliced bread for a couple of years, but recently the company replaced its 1.2G hard disk with a 4G disk. To keep backups on one tape, they've purchased a 4G DAT backup system.

The scariest part is that the 4G tape drive cost about as much as the original Bernoulli box, and the 4G DATs are much cheaper than the 90M cartridges were. As you can see, if you don't keep up with hardware and software trends, your system can suffer for it.

Summary: A Sample Disaster Scenario

The following scenario is based on a fictitious small company, but one that bears similarities to larger ones. First, this company has some disaster-planning mechanisms in place: a backup schedule and some redundant hardware. Other than that, however, they're not in a very good position to cope with problems. The situation and the problems facing the company are laid out in the following pages. Based on what you've read in this chapter, can you offer any suggestions on how Zippy's Gadget Training could survive a catastrophe with the business intact?

The owner of Zippy's, a well-known national marketer of training courses, realized one year that the company had no disaster plan prepared and that tragedy could strike at any time. Although staff members backed up their file server every day, the backups were stored on-site so that anything that damaged the office could damage the backups as well. Those backups comprised the entire disaster plan.

Zippy's is not a large company. Its staff consists of the owner, the business manager, a network administrator who also assists the business manager with purchasing, a coordinator who makes all the hotel arrangements for the classes, and a part-time receptionist. One day, the owner called a meeting of the business manager, network administrator, and coordinator. "Look," he said, "I know that everybody's busy, but we need to make sure that if I get sick, or the office gets hit by a bolt of lightning, or the server dies that the company doesn't go up in smoke. Everybody take a week and come back with the information I'm asking for."

The owner asked:

Everyone came back a week later with their information. The coordinator began, "I'm putting the bulk of the data on the server, as far as I can tell. All the hotel and travel information is there, as well as a complete list of our clients. I add to the hotel information just about every day, but the client list only changes every couple months when we pick up a new client at a show. Also, I asked the business manager, and she says that all the accounts are stored on her local machine, which she backs up every night and then takes the tapes home for protection."

The owner nodded and turned to the network administrator, who said, "We've got one 486 workstation with 16M of RAM for each of us, and that spare 386 that we don't use since we upgraded the receptionist's machine. The hard drives on the workstations are not at capacity; most people have about 100M left because we got those big drives when the price had just dropped again. All the machines are two years old, the same age as the company, except for the 386, which you owned before and donated to the cause.

"The network is a 10Base2 bus topology, and everyone's using WunderLAN network cards. The main laser printer is connected to the file server. So far, our client/server Lighting LAN network operating system is holding up well for our needs, but I'd like to get another backup program. The tapes are fine, but no one knows how to back up the server except me, and the interface is horrible, so it'd really be hard to teach anyone."

"Not a bad idea," responded the boss. "What does the business manager have to report?"

"We've got enough licenses for the word processing program, but we need to get another one for the call-management program that the coordinator's using because the receptionist helps her organize calls and sometimes uses the program at the same time. The property insurance covers all the machines and office equipment (and it's all still under warranty for another three years), but we have no liability insurance in case anyone gets hurt on the job, and there is no insurance protecting us from losses incurred from having to shut down the office.

"Opening the office at an alternative site looks kind of tricky, but there are a couple of ways we might manage it. The cost of renting and equipping another office to move into if this one becomes unusableófloods or whateveróis far more than we can afford. There's another option, though. A company housed in an office building not far from here isn't using all of its suite, and we could rent the couple of rooms that aren't being used as our alternative site. We'd have to bring working computers and telephones (they'll let us use their printer if we chip in for paper and toner), but at least we'd have a place to run the office from. I can get a good price on the rooms as long as we're using them only as a temporary site, for a week or so."

The owner leaned back in his chair. "Okay," he said, "it looks like our situation is this: we've got a lot of data being updated every day and backed up every day. Our office equipment is new and still under warrantyóthat's good. We have a spare workstation if one of the PCs dies, but no spare server. The network is a type that, although prone to failures, is relatively easy to re-cable if we're all in one room or close to it. As far as I can tell, our biggest problems are these:

"Based on this information, what should we do to keep our office running if something happens?"

The rest of the meeting doesn't need to be recorded because only the end result is important. To solve the identified problems, the following actions were taken:

Zippy's disaster plan followed the template in figure 21.1.

Fig. 21.1This disaster plan template identifies some of the basic information that your plan should include. Based on your company's needs, you may change some entries or add others.