Chapter 21 Maintaining the Server and Documents

CONTENTS

Organizing Hierarchy
Access Control
Tool Features
Tools
SSI as a Maintenance Tool
Using CPP Instead of SSI

Setting up the initial Web server can be a challenge. It can also be a lot of fun. Once the server is up and running though the job is not over. Intranets are constantly changing to add new information and to keep existing documents in order.

Maintaining the site requires dedication from all the people working on the project. Links must be kept in order, site maps must be kept current, and new files must be added in an organized manner.

In this chapter, you will learn the following:

How to organize your site
How to use tools to verify links automatically
How to create a site map using tools
How to use Server Side Includes to make maintenance easier

Organizing Hierarchy

After the Web server is up and running, you must decide how to organize the documents on the Web server. It's a good idea to have a file system layout that somewhat mirrors the links in the files. This makes the server almost maintain itself.

Most Web sites have a central home page that splits up into different topics. By choosing the right topics, you can keep the document structure organized and still allow the server to grow and expand. The different topics may be departments, projects, or ideas. These topics may be controlled by one person, or by different people for each topic. In whatever way it is managed, it's easier to understand if the files are organized in a hierarchical order.

Each major topic should be a separate subdirectory of DocumentRoot. This allows the Web server to be split up at a later date if the load gets to be too much for one server. This also makes it easier to find HTML documents when you are looking for them in the file system. Figure 21.1 shows a layout for a Web server.

Figure 21.1 : Each major topic should be a located in a separate directory. This makes it easier to follow when you are editing the files.

For each main topic, you can use the DirectoryIndex, or equivalent directive, to define the home page or each topic. Scripts that tell where in the file system they are can then be written.

It may be desirable to have more than one layer of directories in which to organize files. This allows, for example, a directory tree to have a major topic for software engineering then a subtopic for each project the software group is working on.

NOTE

When deciding on a layout for the server it's often a good idea to map out on paper how the server will be organized. This map should be given to anyone adding documents to the server to help it stay organized.

Once the layout has been decided on, the next step is to decide who will be responsible for the different parts of the server.

Access Control

When a Web server is first created, there is almost always a nice concise plan of how the documents will be structured, how the site will look, and what can and can't be placed on the server. As time goes on, though, the plan is forgotten or ignored and the Web server tends to become a maze of broken links and loops.

Companies often have certain individuals who are responsible for the organization of the server. They make sure people don't add links that don't work and only add what is appropriate for the server. These Web developers often find it easier to define a policy about who can add what, and then use technical solutions to enforce the policy.

These policies are usually one of three types:

Open. This means anyone can do anything. There are no restrictions on who can edit, remove, or add links.
Distributed. This assigns a person to be responsible for a certain area of the Web server.
Central. One person is responsible for adding links, changing documents, and verifying the structure.

Each access policy has its advantages and disadvantages. Some make it easy to add documents and are not as strict; others are very limiting and can be too much of a burden. Commonly, more than one type of access policy is needed for an Intranet. It's important to try to choose the right policy for your company. By choosing an appropriate access policy, the Web server can be easy to update, while order and reason are maintained.

Open

The open model of access control is the easiest to setup. It allows anyone to add, remove, or change any document in the tree. This includes links, pages, and images.

The open model makes it very easy for developers and users to change documents. This ease of change often seems like a good idea at the start, but can quickly turn the site into a maze of disorganized links.

NOTE

The open model may be a good choice in a small company or group, since users can make their own changes without having to bother a developer. Larger organizations, however, might find the open model doesn't afford enough control so that the server quickly becomes disorganized.

To setup the open model, simply set the permissions on the files and directories in DocumentRoot. This enables anyone to make changes. In UNIX this is done by setting the modes to 777, "chmod 777 <name>". NT, Netware, and Notes all use GUI interfaces to set the permissions. In the open model the important thing is to enable anyone to read, write, delete, and create files or directories.

CAUTION

The open model makes it very easy to make changes to the server, but it can be dangerous if an intruder gets in since they can also change any file or directory.

Distributed

The distributed model enables certain people to change certain sections of the Web server. Unlike the open model, only selected people can change the documents or directories. This allows more control over where things are placed and how they look.

The distributed model can be setup to enable different developers to be responsible for separate areas or to jointly manage the entire server. Most companies choose to have each department manage their own document tree.

NOTE

The distributed model gives more control to individual groups. Then each group can add their own documents without having to worry about the entire site becoming disorganized.

When using the distributed model, permissions that allow certain group members to change the files must be set. Most systems make it easy to create a new group and add members. Simply create a group of users who have been approved to make changes. Once the group is set you need to make the files owned or managed by this group.

NOTE

It may be desirable to have more than one group manage a directory. In UNIX you can only have one group managing a directory or file. To get around this you can create a larger group that contains the groups you want to manage the documents.

Central

In the central model, a person or group manages the entire server. This group is required to make all changes to the document tree. Using this method creates centralized control over the server that can reduce the incidents of bad links or poorly placed documents.

The central model, however, makes it hard for users to get documents changed or added, since they must find a developer anytime there is a change. This can cause a bottleneck when many changes are required and take developers away from other job duties.

NOTE

Central access gives companies tight control over what goes on the Web server, but it can make it difficult to get documents changed.

To setup the central access model, simply setup the document tree so only the person responsible can make changes. This is easy to do in any operating system; check your documentation if you aren't sure how to do it.

NOTE

When setting permissions, make sure the server user can at least read all the files. If not, it will return an error when it tries to access them and they won't be displayed in the user's browsers.

Multiple Access Methods

Most organizations require more than one access method. It is common for the server's home page to be maintained using the central access method. This keeps anyone from adding new higher level topics.

In order for users to get changes done easily, the different groups are controlled using the distributed method. This allows groups to contact a local developer when making changes.

The open model may be used for groups as well, although this can lead to that section of the server becoming difficult to navigate.

Whichever model your company decides to use, make sure that the permissions are setup to properly enforce the policy. It is also necessary to make sure the server user can still read the documents.

Tool Features

As a Web server grows, it gets harder and harder to make sure all the documents have HTML tags and that all the links go somewhere. Small Web servers may have a person who checks the links on a daily or weekly basis, but with larger sites this may not be possible.

Fortunately there are some tools that can automatically check the site for you. These tools can be used to check the HTML syntax, or to check for broken links and images. Some tools also allow the site to be indexed and an HTML page to be generated showing the site map. Other tools contain simple ways to change and replace words, or check the spelling in the document.

HTML Validators

HTML validators are used to verify the syntax in an HTML document. This includes checking for misplaced tags, incorrectly nested tags, or tags that aren't closed properly.

Some browsers may handle problems in the HTML properly so checking the document using a browser is not the best way to verify the HTML syntax. If you are using a standard browser in the Intranet and never plan on allowing other browsers to be used, then you can just make sure it looks okay in the standard browser.

Most HTML editors can be used to validate HTML in a document, they can also be used to generate correct HTML. A few limitations are outlined in the following list:

Documents can't be automatically checked. If your site only has 20 or 30 documents, it might be okay to have to view each one in an HTML editor to verify it. If your site has 10,000 documents, this is out of the question.
HTML editors can't check the output of ServerSide Includes. If your site uses SSI, then you can't use an HTML editor to check the output of the program since the server creates it on the fly.

There are a few HTML validators that automatically check the documents that the server generates. They do this by performing a GET on the file, from the server. Not all validators will do this.

Link Checkers

All Web servers have many links. These links sometimes get changed or removed and not all the pointers get updated. Link checkers enable the Web administrator to verify that all the links on the Web server point to a valid document.

Link checkers normally work by getting the first document and then every included document. They do the same for each document. Any errors they receive are reported to the administator.

NOTE

It's possible to check the www log files for errors and fix the broken links that are reported. This is an example of reactive maintenance, fixing the problem when it is a problem. Running a link checker enables you to use proactive maintenance, or fix the problem before ay-one notices it. Proactive maintenance is better than reactive maintenance.

If your site has many links to servers not under your control, you should run a link checker frequently. If there are no external links, you only need to run a link checker when documents change.

Most link checkers can be setup with different configurations, for example, to look for local broken links or to check a certain number of links deep. This saves time because the link checker can be setup to verify only the parts of a site you specify.

Site Mappers

Some tools can be used to create a Web index of a site. This is handy for users to be able to get a view of the site. The index generated may not be quite what you are looking for and some changes might be required.

Tools

We have discussed some of the features ueful in a site maintenance package. This section will discuss some commercial and non-commerical tools available today.

NetCarta WebMapper

NetCarta WebMapper (http://www.netcarta.com/) is a commercial Web site tool that allows Web maps to be created and distributed. WebMapper can be used to allow users to graphically navigate a site by displaying a tree view of the site. When the user selects a page, WebMapper automatically starts a Web browser and displays the page.

WebMapper can also be used to help maintain a site. It allows checking of all links in a site. Using WebMapper can not only verify that a link points to another page, by viewing a graphical map of the site you can verify that the links point to the right page. This involves some human intervention but greatly simplifies checking the site.

WebMapper can run on a 486 with at least 8MB of RAM. Like most Windows programs, more memory means faster performance. It also requires 7MB of disk space. WebMapper runs on 32-bit windows systems such as Windows 95 and NT.

WebMapper also allows you to generate statistics on the number of links, images, audio files, video files, CGI scripts, applications such as java apps, Word documents or PDF files, and Internet services such as FTP or gopher servers. These statistics can help spot links that you were unaware of; for example, if someone added a link to an FTP site or added a CGI script.

WebMapper can also be used to generate an HTML map of the site. Simply map the site and export the map as HTML. This can be done by using the Export as HTML link under the File menu.

WebMapper has many nice features such as the graphical viewing and mapping of a site. It does not, however, check the HTML for correctness. Even without this ability though WebMapper is a very nice utility.

Incontext WebAnalyzer

WebAnalyzer from Incontext (http://www.incontext.com/) allows you to graphically view a site much like WebMapper. It runs on Windows 95 systems. WebAnalyzer also allows you to filter for certain characteristics such as link type or size. WebAnalyzer can also check that links are pointing to another document.

When you start WebAnalyzer you will see a blank screen with a spot to enter an URL. Enter the URL of the site you want to analyze. This will cause WebAnalyzer to start mapping and checking your site.

Once it is finished you should have a nice view of your site. You can see a page's properties by selecting it and choosing URL, Properties. This displayes the name, MIME type, last modified date, size, also the number of links to the page and from the page and the depth, or number of links it took to get there.

WebAnalyzer also allows you to filter for certain types of files. This features is very handy since it allows you to find any image bigger then 10KB. Then you can go to those images and try to reduce the size. You can also search for unresolved links, text-only links, or define your own search filter.

WebAnalyzer has a row of menus across the top. These include: File, for opening, closing, and saving maps; URL, for finding information about a page; Project, for starting or stopping a search, as well as making or viewing a report. Another menu is the tools menu. This allows you to customize how WebAnalyzer looks and runs. This is also where you define the Web browser and editor you are using. You can also define your filters under this menu.

WebAnalyzer is very nice for finding large images and for graphically viewing your site. Like WebMapper, WebAnalyzer doesn't check the HTML for correctness.

htmlcheck

htmlcheck can be used to check the syntax of HTML 2.0 or 3.0 documents. It can also be used to verify local links. htmlcheck can be run on any machine with either Perl or awk. This includes UNIX, DOS, Windows, and Macintosh.

htmlcheck can also handle Netscape extensions. htmlcheck may report errors or warnings on some HTML that is actually legal but doesn't work well on some browsers. This means it checks for viewability more than adherence to the specification.

You can get ftp htmlcheck from ftp://ftp.cs.buffalo.edu/pub/htmlcheck. There are three versions available:

htmlcheck.tar.Z. This is the compressed tar file containing the distribution. It requires uncompress and tar to unpack.
htmlcheck.tar.gx. This is the gzipped tar version. It requires gunzip and tar.
htmlcheck.zip. This requires Pkunzip or unzip to get at the files.

No matter which version you download and unpack, you should end up with a directory full of files. The HTML validator is made up of the following files:

htmlcheck.awk. This is used if you are using awk for the interpreter.
htmlcheck.pl. This is the Perl version of the validator.
htmlcheck.sh. This is a shell script that picks the best available interpreter.

To run the program you can use one of the following syntaxes:

awk-htmlcheck.awk [options] infile.html
perl htmlcheck.pl [options] infile.html
sh htmlcheck.sh [options] infile.html

The sh version requires a UNIX machine, but the other two versions can be run on any machine, as long as Perl or awk are available.

NOTE

Awk can be replaced with nawk or gawk; both are newer versions of awk and can be run on different platforms, such as NT or DOS.

The infile.html can be replaced with any HTML file or files. To test all HTML files in a directory, use *.html or *.htm. The output is sent to standard output and can be saved to a file by using normal redirection, such as appending > outputfile to the command.

The options that can be used are described in the manual page for htmlcheck. The most popular ones are:

netscape=1. If this option is given, Netscape tags do not generate an error.
html3=1. This option is used to check HTML3 files. It causes htmlcheck to not give an error when it sees an HTML3 tag.
cf=filename. This can be used to define a configuration file. This is useful when there are many options.

htmlcheck can also be setup to automatically generate a table of contents for an HTML document. This is done by using the makemenu program that is shipped with htmlcheck.

If the makemenu program is run with toc=1 as an option, it tries to generate a table of contents for the document. This is done by looking at the headers (H1-H6).

makemenu can be run one of three ways:

awk-f makemenu.awk [options] infiles
perl makemenu.pl [options] infiles
sh makemenu.sh [options] infiles

makemenu does a good job of indexing a single file, but can't handle an entire site yet.

MOMspider

MOMspider stands for MultiOwner Maintenance spider. It's designed to make it easier to maintain a Web space where multiple servers aren't all maintained by the same person. This describes a large Intranet quite well, since in a large Intranet each department might have its own server.

MOMspider is normally setup to run on a nightly basis from the UNIX cron program. This enables the Web administrator to keep abreast of changes. MOMspider is careful not to overload a Web server and only sends a few requests before pausing for a few seconds. This way the server can keep managing requests without any noticeable performance problems.

MOMspider requires Perl version 4.036 or higher to run, but should run on any UNIX machine that has PERL. In addition to the Perl interperter, MOMspider also requires the libwww.perl library version 0.30 or higher.

PERL can be downloaded from http://www.cis.ufl.edu/perl and libwww-perl can be downloaded from http://www.ics.uci.edu/libwww-perl. Once you have these two packages downloaded and installed, you are ready to get MOMspider.

MOMspider can be downloaded from:

Once it is downloaded and unpacked, you should have a number of files and directories:

MOMspider. This is the MOMspider main program.
*.pl. These are modules that help MOMspider run properly.
docs/*. This is the documentation for MOMspider. It also contains the postscript version of the WWW94 paper on MOMspider.
examples/*. These are example configuration files for MOMspider.

The first step is to edit MOMspider and make sure the correct pathname for perl is in the first line. The default is /usr/local/bin/perl.

You also need to define the MOMSPIDER_HOME environment variable. Csh users can use "setenv MOMSPIDER_HOME /path/to/momspider". Sh or Ksh users can use "export MOMSPIDER_HOME=/path/to/momspider".

Next you need to edit the configuration file. You have to change the LocalNetwork name so that MOMspider knows which links are local and which are remote. This should be all you need to change. If your server is slow, you might also want to check out the rest of this file. There are some options that set the max depth, max number of consecutive hits without a pause, and how long to pause.

Once the configuration file has been changed, you need to edit the instruction file. The instruction file is made up of several global directives and traversal tasks.

The global directives are:

SystemAvoid. The name of the systemwide avoid file.
SystemSites. The name of the systemwide sites file.
AvoidFile. The name of the users avoid file.
SitesFile. The name of the users sites file.
SitesCheck. The number of days between checking the /robots.txt file. Normally 15.
ReplyTo. Who should get e-mail about the spiders actions.
MaxDepth. The maximum number of links deep to traverse. Used to keep MOMspider from getting stuck in a loop.

There are three types of traversals. These are:

Site. This searches an entire site. It doesn't follow links off site, but checks to make sure they exist.
Tree. This checks a tree in a server. It doesn't follow links out of the directory tree but verifys that they exist.
Owner. This looks at the metainformation to check the links owners. It only checks links that are owned by the correct group. This is not normally used.

The other directives that we will look at include:

Name. The owner of the spider is used in owner traversals. This is a required field and must be a single word.
TopURL. This is where to start the traversal.
IndexURL. Where the index file will be for this traversal. Must be a full URL and must be included.
IndexFile. The actual filename for the IndexURL file.
IndexTitle. This is an optional field that is used for the title of the index file.
EmailAddress. This is who the report will be e-mailed to.
EmailBroken. If this is in the file, a mail message will be sent when a broken link is found.
EmailRedirected. If this directive is specified, then an e-mail message will be sent when a link is a redirected link.
Exclude. This is a URL prefix that should not be traversed. This might be something that is already checked or out of your control.

A properly organized Web site can probably use an instruction file, similar to the following:

AvoidFile   /path/to/momspider/momspider-avoid
SitesFile   /path/to/momspider/momspider-sites
<Site 
    Name          Company
    TopURL        http://www.company.com/
    IndexURL      http://www.company.com/siteindex.html
    IndexFile     /webserver/htdocs/siteindex.html
    IndexTitle    MOMspider Index for Company.com
    EmailAddress  admin@company.com
    EmailBroken
    EmailRedirected
>

This tells MOMspider to check all links in www.company.com and e-mail admin@company.com, if any broken or redirected links are found. It will also generate an index file and put it in /webserver/htdocs/siteindex.html.

The index MOMspider creates is more useful for reference than an actual drop in site map. It contains the following information:

The date the index was started.
Who ran the index.
A table of contents for the index. This is at the top of the page.
A section of links and documents in the index. These links contain the return code, the title of the file, and the last-modified date.
A summary table. This lets the administrator see statistics on how many links were checked, avoided, broken, redirected, and changed.
A list of broken links.
A list of redirected links.
A list of changed link destinations.
A list of expired documents.

SSI as a Maintenance Tool

Server Side Includes (SSI) can be used to make a site easier to maintain. Scripts can be written to automatically generate links to the last referred document, the document up a layer, a help screen, or any information that should be on each document.

SSI can be used to create a systemwide template. This can automatically add a header and footer to all the pages. It allows the look and feel of the document to be changed easily.

A sample header might add the <HTML>, <BODY>, and <TITLE> tags. A footer may have the company information, copyright information, and links to the index, help, and home pages.

Using SSI to add headers and footers to a page makes it easy to change how a site looks and to add new features as they are developed. Since every page uses these headers and footers you can easily change the entire document by editing one simple file.

NOTE

Using SSI adds some CPU overhead since the server must parse the file each time it is accessed. Some sites may find this to be too much CPU processing for the server to handle without slowing down.

Below is a sample header program that can be used to automatically add the title and opening tags to a document. It's written in a shell script, but can be written in any language.

echo '<html>'
echo '<head>'
echo '<title>'"$*"'</title>'
echo '</head>'
echo '<body>'

This script simply adds the header information and uses the arguments for the title. When called like this:

<!--#exec cmd="header Company Web Site" -->

It will prepend the following to the HTML document when it sends it:

<html>
<head>
<title> Company Web Site</title>
</head>
<body>

Here is a sample footer SSI, which can be used to append the generic information to the bottom of every page:

echo '<A HREF="/phone.number.html"><img src="/images/phone.gif"  alt="Phone"></a>'
echo '<A HREF="/readme.html"><img src="/images/question.gif"  alt="?"></a>'

echo '<HR>'
echo 'This page, and all contents, are Copyright  1995 by Company'
echo '</BODY>'
echo '</HTML>'

This would append the following to the end of every script that called it:

<A HREF="/phone.number.html"><img src="/images/phone.gif"  alt="Phone"></a>
<A HREF="/readme.html"><img src="/images/question.gif" alt="?"></a>
<HR>
This page, and all contents, are Copyright  1995 by Company
</BODY>
</HTML>

You can organize your document tree to contain navigational links, which make it easier for users.

Let's assume that we created a main directory with a separate directory for each main topic. In each directory the main page is called index.html. Any subdirectories are also set up this way. This gives our script the following information about navigation:

If the name of this document is index.html, up a level would be referenced by ../index.html.
If the name of this document isn't index.html, then up one would be index.html.
If the name of this document is /index.html, we are already at the home page.
Some browsers send a HTTP_REFERER header, which tells the server where they came from. If HTTP_REFERER is set, we can use that to generate a back page.

Our script can now generate links to up one page, Home (if we aren't already there), and back (if our browser supports it).

The finished footer script is follows:

#!/bin/sh
echo '<HR>'
# Are we already home?
#If so then we don't send an up or a home icon
if [ x"$DOCUMENT_URI" != x"/index.html" ]
then
        echo '<A HREF="/index.html"><img src="/images/logo.gif"  alt="Home"></A>'
        echo '<A HREF="../index.html"><img src="/images/up.gif"  alt="Up"></A>'
fi

# If this is not called index.html then up one would be index.html
if [ x"$DOCUMENT_NAME" != x"index.html" ]
then
        echo '<A HREF="index.html"><img src="/images/up.gif"  alt="Up"></A>'
fi

# This checks to see if we know where the user came from if so we
#   can generate a back icon.
if [ x"$HTTP_REFERER" != x"" ]
then
        echo '<A HREF="$HTTP_REFERER"><img src="/images/back.gif"  alt="Back"></A>'
fi

#everyone gets this part
echo '<A HREF="/phone.number.html"><img src="/images/phone.gif"  alt="Phone"></a>'
echo '<A HREF="/readme.html"><img src="/images/question.gif"  alt="?"></a>'
echo '<HR>'
echo 'This page, and all contents, are Copyright  1995 by Company'
echo '</BODY>'
echo '</HTML>'

Using CPP Instead of SSI

If your server doesn't support SSI, you can perform some of the same functions by using the CPP program. CPP is a program that the C compiler uses. It handles such things as #include filename.h. It can also be used as an HTML pre-processor to include static footers and headers. This requires two files: the Source file and the HTML file. In order to make changes you edit the Source file and run it through CPP to create the HTML file. The HTML file is the file that the server sends.

For example, instead of using an SSI to append our original static footer, we could create a file called footer.html. In it place the following:

<A HREF="/phone.number.html"><img src="/images/phone.gif"  alt="Phone"></a>
<A HREF="/readme.html"><img src="/images/question.gif" alt="?"></a>
<HR>
This page, and all contents, are Copyright  1995 by Company
</BODY>
</HTML>

Then in all the HTML source files you put:

#include "footer"

This gets replaced with the contents of the footer file when you ran it through CPP. CPP is not available on all systems and requires two files instead of one. It does not, however, have the CPU overhead that SSIs have and for large sites the trade-off might be worth it.