Without a title

14
Archive and Document Management

Archive and Document Management

General Archive Management Considerations
HTML with Perl

The typical Webmaster is often challenged by tasks other than creating HTML or writing CGI programs. He or she also must be familiar with many other techniques and practices that are commonly used to build and maintain a networked archive and its components. In this chapter, we'll discuss a number of those tasks and provide you with some tools to help accomplish them.

General Archive Management Considerations

The art and philosophy of archive management on a network predates the Web by a long time. One of the primary intents of the Internet was, and still is, to allow the sharing of documents. Some of the early protocols and tools for sharing electronic resources are still in wide use today, including FTP, NFS, and even Gopher.

When making resources available via any type of server, you need to consider a number of tactics and practices. Some of these are related to security and are explored in Chapter 3, "Security on the Web." There are many others, and as far as I know, a document which covers them all does not exist. The collective experience of the many thousands of administrators who have contributed to and defined this body of knowledge would be difficult to summarize in a library, much less a single chapter in a book.

There are, however, a number of general issues that you become aware of as you develop an archive and explore the work that others have done. I hope to cover many of the important issues and their associated tasks in this chapter. Again, and as always, you can explore other resources, including Usenet, various Web sites, and possibly even individual administrators who you feel have done things the way you believe might work for you. I suggest that if you find such a site, you might try dropping a line to the administrator, asking him or her to share a few tips. Of course, you may be completely ignored, but you may also be rewarded with a buried bone or two, which might save you time and energy in the future.

You'll notice in this chapter that Perl isn't the primary topic on every page. As we've said, the intent of this book is to show and teach you how to use Perl in your Web programming duties and tasks. On the other hand, in other works we've studied, the coverage of the issues and topics in this chapter seems to be rather minimal. I'm covering some of the topics in this chapter primar-
ily for the sake of completeness.

Planning, Design, and Layout

The structure and layout of your archive is one of the important decisions you'll make if you're just starting out. There are a number of issues to consider, and decisions to make, when you're first laying out your archive. After you've made these decisions, it won't be quite so easy to make changes to the structure and/or layout. You should plan carefully and try to consider all of the possibilities for what may happen to your archive in the future--before you ever create the first directory or file. Let's consider some of the most important issues now. Document Naming The names that you give to your documents and directories are important for several reasons. First, and possibly most useful to you as the archive maintainer, is to have some sort of notion of what's inside a document or directory, based on its name. Another consideration is whether the files and directories you'll create must be usable on DOS or other architectures that don't support long filenames.

NOTE:

There are essentially two schools of thought on naming file system elements. The first stipulates that one should assign names to the elements within an archive that allow for the ability to determine the contents of the file or directory based on the name. The second, also known as the ISO9660 specification, stipulates that the names should follow the 8.3 format and use only alphanumeric characters. Obviously, the restriction to only eight characters in the primary component of the name restricts your ability to assign names based on contents. You should consider whether your archive will ever need to reside on an operating sys-
tem that requires the 8.3 format (DOS), or whether you'll ever make it available via
CD-ROM. In either case, you'll probably want to choose the ISO9660 naming conventions. Don't forget the possibility that in the future, you may wish to have your archive mirrored to a system that doesn't support long names as well. If you're already running under a file system that handles long names and need to migrate or mirror your archive to a system that only handles the short names, you might have to make some major changes in order for everything to work. We'll discuss how to perform this transition later in this chapter, in the section entitled "Moving an Entire Archive," but it's definitely nontrivial.

In any case, you'll need to reserve the extension component of the filenames for MIME typing, which allows your server to properly send the browser the appropriate instructions for how to handle the document. See Chapter 5, "Putting It All Together," for more details. Be sure to check that your server's mime.types and srm.conf files follow the standard conventions for extensions, and add configuration entries to your server for any additional types that you define. Archive Hierarchical Organization The directory tree that makes up your archive is one that you'll be "climbing" up and down quite often. You should make its branches easy to remember and intuitive to understand. Each new resource in your archive will have to be stored somewhere in this tree. When you use an unambiguous, comprehensive structure for classification of resources according to their storage location, deciding where to place things (and where to find them later) will be a lot easier.

After you've decided on a naming convention, you'll want to spend some time planning the structure of the directories. Naturally, if you're using long names, you can be pretty creative with your layout; if not, then I recommend that you use some sort of simple mapping from an ordered list of eight-character names to corresponding groups or classifications.

You might point out, and you'd be correct, that the structure of the HTML document already gives the notion of hierarchy to the resources to which it refers. However, this applies only to the browser and gives no advantage to the maintainer of the documents. Creating structure, in the form of directories (or folders) in your archive, makes the HTML a bit more complicated to write but relieves the confusion and intimidation of having all the files reside in one location.

Configuration Management

"A set of procedures for tracking and managing software throughout its lifecycle" (Configuration Management for Software, Compton & Connor, 1994, ISBN
0-442-01746-4).

This notion of structure also arises from the science of configuration management in general. We'll be discussing another aspect of configuration management, revision control, later in this chapter. Access and Security Another advantage of creating a structured archive is the ability to restrict access on a per-directory basis of most HTTP servers. Configuring the server to do this has been covered elsewhere, and I won't go into how it's done here. I point it out only to highlight the added value of planning and creating a sound directory structure for your archive. Of course, the implication is that you've planned carefully and created the structure in such a way as to use this feature selectively, another consideration in the planning stage. Top-Level Documentation Every archive directory should have some sort of an explanation of what its purpose is and, ideally, a description of the contents. Whether this description is intended only for the maintainer and/or for public access is up to you. Ideally, this file would be located within the directory that it describes. It could be the index.html and thus serve the dual purpose of describing the contents to the browser and the maintainer. This document (probably just a text file) will help the person considering a change to the archive's contents decide whether this location is appropriate for the change or addition.

Revision Control

The process (and rigor) of revision control is often overlooked or even ignored when an administrator manages an archive. However, there are some very good reasons you should use some sort of version control when creating and updating your resources. A Policy for Change--Description The process of making your documents available via the Web is really one of publishing. When you, as a representative of a company, make a document available, you're making a statement that represents your company. While some of the issues and legalities are still murky, you should consider the liability that you or your company assumes when making documents available. The information within the documents should be correct, and insofar as is possible, verifiable, and free of misrepresentations.

Such considerations give rise to the need for a policy for the management of the archive. This policy should be both comprehensive and understandable by anyone who will participate in the creation of the archive or the modification of its contents.

A policy for change can be as formal as you like. In general, the following items should be considered when designing the policy:

Creation--The process of creating new elements or the hierarchy (directories) within the archive.
Updating--The process of revising the archive's elements or hierarchy.
History--Retaining older versions of documents for future reference or consideration, including the purpose of reverting back to a previous version.
Accountability--Who installed or updated the document, when did he/she perform this action, and why?

The level of complexity in a policy for change that can arise out of just these four items might surprise you. For instance, when dealing with source code and documentation for a given software product, some organizations implement a multi-tiered structure of committees, forms, and checklists which any change or addition must pass through before being applied to the released document(s) or code. At some point, productivity may suffer if the process becomes too complex. The idea is to find a happy medium between no policy at all and one that bogs you down. A Policy for Change--Motivation Plenty of things can go wrong when you're populating an archive or updating its elements. In the best cases, the Webmaster or Web team is immediately made aware of the problem and is able to deal with it. On the other hand, some minor errors in documents or functionality may go unnoticed for a long time, potentially becoming a permanent problem due to the number of copies of the documents that were distributed with the error. With the number of indexers, auto-mirrors, archiving proxies, and other forms of duplication that exist today on the Web, the proliferation of errors can be almost immediate and quite difficult to overcome.

Obviously, the best policy is one in which no documents would be distributed with errors or misrepresentations. However, implementing such a policy is quite difficult, even if you already are using a sound policy for change. If you don't have a change policy, then the difficult becomes practically impossible. A number of situations can lead to errors; let's consider a few of them.

Multiple Versions--A document, perhaps an image, script, or applet, may exist in several locations within the archive. A copy is updated, but the changes may not propagate to the other installed copies. Because no single copy is designated as the master copy, changes also may occur independently to the copies, causing additional problems.
Simultaneous Updates--An archive or one of its components may be managed by a group of people. This inevitably leads to simultaneous changes in some element, if some form of revision control is not used. Suppose one person copies a document, starts making changes to it, and before he/she is finished, someone else makes another copy of the same document and starts making changes. The inevitable outcome is that one or the other's work will be lost, depending on who copies the changed document back into the archive last.
Security/Access--Some operating systems provide a means to restrict access to files based on the UserID or group. These mechanisms lack the functionality necessary for a dynamic, effective policy for allowing a particular person to perform a particular task on a given element in an archive. Such tasks may need to be performed on a repetitive basis, or possibly only once, by a given Web team member or other individual. A need also may exist to allow certain types of access (for example, reading), but disallow others (such as updating) on a given element in the archive based on the local UserID. Some file systems have this sort of functionality built in, via Access Control Lists (ACLs), but these mechanisms may be still inadequate and are rarely enforceable across networked file systems or different architectures.
Accountability/Audits--If more than one person has the ability to make changes to the archive, then tracking the changes and who made them becomes difficult. In case of an error or omission, it may be desirable to learn who made the error. Most ordinary file systems don't give you the ability to track changes and who made them. Ideally, each element should have its own history or record of changes made to it, and who made the changes to it during its life in the archive.
Creation/Population--If you, as the Webmaster, have carefully thought through the issues outlined in this chapter and have implemented a policy for change, and then someone decides to create a new directory or other element in the archive without being aware of the policy, you might find this action a bit irksome. The lack of consideration of the plan on the part of this person probably means that you will have to go in and fix things to restore the original order. Allowing others to create new elements in the archive should imply that they understand the issues involved and practices/policies for doing so.

There are other potential problem scenarios I haven't mentioned, but these should give you the general idea. In order to properly maintain an archive, especially as a group or team representing a company, it's essential to use some form of revision control, and to have a well-understood policy for change. A Policy for Change--Solutions A variety of tools and systems are available to implement revision control. Some of them are available for free, and others are commercially available and well-supported. An organization may also wish to implement a home-grown solution, perhaps using Perl and some other tool or tools. We're not going to attempt to implement such a tool in this book, but the following list should give you an idea of what tools are available. I'm also not going to try to give a comprehensive overview here; I'll just cover some of the most popular solutions. ftp://ftp.cs.purdue.edu/pub/RCS/

RCS/CVS This toolset is probably the most widely used tool for revision control on UNIX operating systems. It's a GNU tool, originally created at Purdue University. RCS/CVS has had contributions, bugfixes, and patches like other GNU software from caring individuals all across the Internet. RCS stands for Revision Control System. CVS is a front-end to RCS, which adds functionality and implements additional features to RCS. CVS extends the functionality of RCS by providing the ability to create a private copy of an entire suite of documents, and then optionally lock, modify, and check-in a given document. Each document's changes (deltas) are kept in a storage container corresponding to the name of the document. Ports of RCS/CVS are also available for Macintosh and DOS/Windows. It is freely available, well understood, and help is fairly easy to find via the documentation, Usenet, or mailing lists. It operates primarily on text files. RCS/CVS is available at most standard Usenet sources archive sites and always at Purdue: ftp://ftp.cs.purdue.edu/pub/RCS.

http://www.atria.com

ClearCase This toolset, available through Atria, Inc., actually implements a complete file system and is possibly the most powerful, complex, and configurable of any other configuration management tool available. It's primarily used for source code control and software project management but makes a very nice archive management tool as well. I keep the chapters and sample code that comprise this book under ClearCase. ClearCase lacks a Macintosh interface, but it can export its files via NFS. It operates on text files, binaries, images, and even directories, along with any other filetype you wish to configure. It is available through the Pure-Atria sales staff at Pure-Atria, Inc., and through the Web site at http://www.pureatria.com.

http://www.microsoft.com/SSAFE/Default.html SourceSafe

SourceSafe is another toolset available as a commercial product. In terms of functionality, it looks and feels much like CVS, but it implements a database for its internal references to revisions and history and has additional features and user interfaces. SourceSafe operates on text, binaries, and images. I haven't used the SourceSafe toolset in the role of archive management, but it seems to have the necessary functionality. Microsoft also seems to be actively adding functionality and support since it acquired the SourceSafe product. Implementations of SourceSafe are available for UNIX, Macintosh, and Windows.

http://www.mks.com MKS Toolkit

The MKS Source Integrity toolset is another revision control system. I haven't actually seen this implementation, but because it's from MKS, you can bet it has an implementation for Windows. Contact MKS through its Web page at http://www.mks.com.

Each of these tools has advantages and disadvantages, and there are certainly other tools available that I'm not aware of. Investigate as many systems as you can, then choose one and stick with it. The process of checking the archives' elements in and out each time you wish to update them might seem a bit rigorous at first, especially for those who've never used a revision control system, but in the long run, revision control always pays off, and you'll be glad you took the time to implement it.

Summary of Archive Management Issues

The topics we've discussed so far form the basis of the important issues and considerations for managing an archive on the Net. As I mentioned earlier, the needs and requirements for a configuration management plan vary from site to site. Even the simplest plan should include instructions and policy for the following actions, derived from the policies and problems listed earlier in this chapter:

Naming Conventions--Choosing long names versus ISO9660 names, and implementing appropriate and uniform extension naming.
Archive Layout--Creating hierarchy that is easily navigable but usable and useful for classification of elements.
Version Control--The ability to reconstruct any previous version of any of the archives element, at any given time.
Access Control--Access to elements based on dynamic needs and changing personnel duties, allowing possibly for off-site modification. Configuring the elements' permissions appropriately.
Sequential Changes--One and only one change to an archive element at a time.
Updates--A specific process to merge changed elements back into the archive as the new release. May include approvals or a consensus, and ideally should be automated.
Creation--Based on the layout plan and organization of the current elements and directory structure.
Accountability--Must have the ability to verify who made a change or creation, what was changed, and when, where, and why a change or creation occurred.
Verification/Testing--Manual or automated verification of the correctness of the new release, and that it hasn't affected any other component of the archive's functionality.
Reporting--The ability to report to anyone who might be interested, and has a right to know, regarding the usage and access of, and changes to, the archive. This task might possibly be automated, sending out reports on a nightly basis, for instance.

The rest of this chapter focuses on the specific tasks within these topics that you might face in the day-to-day management of the archive. The focus also now returns to how you can use Perl to implement these tasks.

Parsing, Converting, Editing, and Verifying HTML with Perl

One of the more important, but less well-documented duties of the Webmaster is updating and verifying the HTML in the archive's documents. Aside from the need for revision control, which we've already mentioned, how does one actually go about making changes, potentially en masse, to the archive's HTML documents? Once the changes have been performed, how does the Webmaster verify that they have not affected any other component of the archive? Fortunately, text management is one of the great strengths of Perl, and there are a number of modules and tools for accomplishing this task.

General Parsing Issues

The process of parsing an HTML document implies several algorithms. First, you must be able to recognize and possibly take action on each of the elements in the HTML specification, in the input stream, and on-the-fly. Usually, you'll wish to find the URLs or anchors in a document, but even this turns out to be non-trivial when you're attempting to match a URL with a single regular expression. Even the newer Perl5 Regular Expression Extensions don't completely solve the problem, partly because making the determination of validity depends on whether the URL is complete, partial, or relative. Fortunately, there is a Perl5 module devoted specifically to parsing HTML and URLs.

As it turns out, the best way to parse and determine the validity (but not necessarily the retrievability or existence) of a given URL is via a method chosen dynamically from a table or set based on the URL's protocol specifier. This sort of runtime decision-making is exactly how the URI::URL.pm module works, and using it saves you a lot of guesswork, testing, and/or debugging, and spares you from having to create potentially mind-boggling regular expressions to match the various types of URLs that exist.

When parsing HTML to find the embedded URLs, you'll also need to use the module called HTML::TreeBuilder. This module takes care of the gory details in parsing the other internal elements from an HTML document and builds an internal tree object to represent all the HTML elements within the file. These modules are part of the Web toolkit that you've been using throughout this book, called libwww. The complete suite of libwww modules includes the URI, HTML, HTTP, WWW, and Font classes. Libwww is written and maintained by Mr. Gisle Aas of Norway. The latest version is always available from his CPAN directory:

~authors/id/GAAS/

Listing 14.1 demonstrates how to use these modules to extract simple URLs from an HTML file.

Listing 14.1. simpleparse.

use URI::URL;

use HTML::TreeBuilder;



my($h,$link,$base,$url);



$base = "test.html";

$h = HTML::TreeBuilder->new;

$h->parse_file($base);



foreach $pair (@{$h->extract_links(qw<a img>)}) {

    my($link,$elem) = @$pair;

    $url = url($link,$base);

    print $url->abs,"\n";

}

This short script prints out all the links in the file, test.html, whose attributes begin with an A or IMG tag. If you want to parse the file returned directly from the server, you would use the parse method, instead of the parse_file method. You'll also need to add the capability to slurp an HTML file directly from the server with the declaration

use LWP::Simple qw(get);

Now the script looks like that in Listing 14.2.

Listing 14.2. simpleparse-net.

use URI::URL;

use HTML::TreeBuilder;

use LWP::Simple qw(get);



my($h,$link,$base,$url);



$base = "http://www.best.com/";

$h  = HTML::TreeBuilder->new;

$h->parse(get($base));



foreach $pair (@{$h->extract_links(qw<a img>)}) {

    my($link,$elem) = @$pair;

    $url = url($link,$base);

    print $url->abs,"\n";

}

Running Listing 14.2, with the libwww module properly installed creates the following output, based on the URL

http://www.best.com/index.html

my current ISP, and all of the A and IMG links within that page:

http://webx.best.com/cgi-bin/imagemap/mainpl.map

http://www.best.com/images/mainpnl3.gif

http://www.best.com/about.html

http://www.best.com/images/persoff.gif

http://www.best.com/corp.html

http://www.best.com/images/corpserv.gif

http://www.best.com/policy.html

http://www.best.com/images/ourpol.gif

http://www.best.com/support.html

http://www.best.com/images/faq.gif

http://www.best.com/prices.html

http://www.best.com/images/pricepol.gif

http://www.best.com/pop.html

http://www.best.com/images/lan.gif

http://www.best.com/corpppp.html

http://www.best.com/images/webpd.gif

http://www.best.com/client.html

http://www.best.com/images/hosted.gif

http://www.best.com/images/announce.gif

http://www.onlive.com/

http://crystal.onlive.com/beta/index.htm

http://www.best.com/best_resort/entrance.sds

mailto:info@best.com

http://www.best.com/images/best4.gif

mailto:www@best.com

Note that this listing may vary, depending on your location and whether there have been changes to index.html since this chapter was written. Now that you've seen how to use the LWP modules to do some very simple parsing, let's take a look at how to use them for some useful tasks.

Editing and Verifying HTML

You can use Perl in a number of ways to make changes in and perform verification and validation on HTML. There are modules that handle the parsing and substitutions, as well as several complete tools to check the syntax of the HTML and the validity of the internal anchors to other locations and documents. The following examples demonstrate how to use these tools to perform tasks that may confront you as a Webmaster from time to time. Converting from Absolute to Relative URLs Suppose that at some point, when the Webmaster is coming up to speed on the HTML specifications, he or she creates a document that uses the complete form of the URL in all links, giving the scheme, host, and path. Later, as understanding grows, the Webmaster wishes to go back and change all the links in the HTML documents that correspond to local resources to have the relative form. This way, if any site is mirroring his/her site, requests for local documents from the mirror copy will be served from the mirror site instead of the master site.

In order to accomplish this task, you'll need to start with the script that parses URLs generally, shown in Listing 14.2. Then you'll add the capability (see Listing 14.3) to print out the new HTML file with the links changed to relative form when they refer to local resources.

Listing 14.3. relativize.

#!/usr/bin/perl



# relativize - parse html documents and change complete urls

# to relative or partial urls, for the sake of brevity, and to assure

# that connections to mirror sites grab the mirror's copy of the files

#

# Usage: relativize hostname file newfile basepath

# hostname is the local host

# file is the html file you wish to parse

# newfile is the new file to create from the original

# basepath is the full path to the file, from the http root

#

# Example:

# relativize www.metronet.com perl5.html newperl5.html /perlinfo/perl5

#

# Note: does not attempt to do parent-relative directory substitutions



use HTML::TreeBuilder;

use URI::URL;

require 5.002;

use strict;



my($h,$filename,$link,$base,$url);

my($usage,$localhost,$filename,$newfile,$base_path);



$usage ="usage: $0 hostname htmlfile newhtmlfile BasePath\n";

$localhost= shift;

$filename = shift;

$newfile= shift;

$base_path = shift;



die $usage unless defined($localhost) and defined($filename)

     and defined($base_path) and defined($newfile);



$h = HTML::TreeBuilder->new;

$h->parse_file($filename);



(open(NEW,">$newfile")) or die($usage);



$h->traverse(\&relativize_urls);



sub relativize_urls {

    my($e, $start,$depth) = @_;



    # if we've got an element

    if(ref $e){

        my $tag = $e->tag;

        if($start){



            # if the tag is an "A" tag

            if($tag eq  "a"){

                my $url = URI::URL->new( $e->{href} );



                # if the scheme of the url is http

                if($url->scheme eq "http"){



                    # if the host is the local host, modify the

                    # href attribute to have the relative url only.

                    if($url->host eq $localhost){



                        # if the path is relative to the base path

                        # of this file (specified on command line)

                        my $path = $url->path;

                        if($path =~ s/^$base_path\/?//){

                            # a filetest could be added here for assurance

                            $e->attr("href",$path);

                        }

                    }

                }

            }

            print NEW  $e->starttag;

        }

        elsif((not ($HTML::Element::emptyElement{$tag} or

                $HTML::Element::optionalEndTag{$tag}))){

            print NEW $e->endtag,"\n";

        }



    # else text stuff, just print it out

    } else {

        print NEW $e;

    }

}

In the subroutine relativize_urls(), I've borrowed the algorithm from the HTML::Element module's method, called as_HTML(), to print everything from within the HTML file by default. A reference to the relativize_urls() subroutine is passed into the traverse() method, inherited in the HTML::TreeBuilder class, from the HTML::Element class. The desired changes to the complete URLs that refer to local files are made after verification that the path component has the specified base path, and the host is the localhost. The output goes to the new HTML file, specified on the command line.

There are plenty of other uses for the HTML::TreeBuilder class and its parent classes, HTML::Element and HTML::Parser. See the POD documentation for the libwww modules for more details. Moving an Entire Archive Copying an external archive may give rise to the need to change the file or directory names associated with the external site, and then to correct the URLs in the HTML files. There may be several reasons for this: The copying site may wish to use a different layout of the archive; or, as mentioned previously, it may be using a DOS file system or follow an ISO9660 naming policy, which requires a change of file or directory names if they're not ISO9660-compliant. Placing an archive's contents on a CD-ROM may also require renaming or re-organizing the original structure. Whatever the reason, this task can be quite intimidating to perform.

The algorithm itself implies six steps and three complete passes over the archive, using File::Find, or something similar, in order to get everything right. Let's consider the case where you need to migrate an archive from a UNIX file system, which allows long filenames, to a DOS file system, which doesn't. I'm not providing any code here; I'm simply outlining the algorithm, based on a former consulting job where the I performed the service for a client.

Step 1: List Directories The first pass over the archive should create a listing file of all the directories, in the full path form, within the archive. Each entry of the list should have three components: the original name; then if the current directory's name is too long, the list entry should have the original name with any parent directories' new names; followed by the new name, which is shortened to eight alpha-numeric characters and doesn't collide with any other names in the current directory, prepended with all of the parent directories' new names.

Step 2: Rename Directories Directories should be renamed during this step, based on the list created during pass one. The list has to be sorted hierarchically--from the top level to the lowest level--for the renaming operations to work. The original name of the current directory, with its parent directories' new names as a full path, should be the first argument to rename(), followed by the new short name, with any new parents in the path. These should be the second and third elements of the list created during pass one.

Step 3: List Files The third step makes another pass over the archive, creating another list. This list will have the current (possibly renamed) directory and original filename of each file, as a full path, followed by the current directory and the new filename. The new filename will be shortened to the 8.3 format and with verification, again, that there are no namespace collisions in the current directory.

Step 4: Rename Files The fourth step should rename files, based on the list created in pass three.

Step 5: Create HTML Fixup List The fifth step in the algorithm takes both lists created previously and creates one final list, with the original filename or directory name for each file or directory, followed by the current name. Again, both of these should be specified as a full path. This list will then be used to correct any anchors or links that have been affected by this massive change and that live in your HTML files.

Step 6: Fix HTML Files The final step in the algorithm reads in the list created in Step 5, and opens each file for fixing the internal links that still have the original names and paths of the files. It should refer to the list created in Step 5 to decide whether to change a given URL during the parsing process and overwrite the current HTML file. Line termination characters should be changed to the appropriate one for the new architecture at this time, too.

It's a rather complicated process, to be sure. Of course, if you design your archive from the original planning stages to account for the possibility of this sort of task (by using ISO9660 names), then you'll never have to suffer the pain and time consumption of this process. Verification of HTML Elements The process of verifying the links that point to local documents within your HTML should be performed on a regular basis. Occasionally, and especially if you're not using a form of revision control as discussed previously, you may make a change to the structure of your archive that will render a link useless until it is changed to reflect the new name or location of the resource to which it points.

Broken links are also a problem that you will confront when you're using links to external sites' HTML pages or to other remote resources. The external site may change its layout or structure or, more drastically, its hostname, due to a move or other issues. In these cases, you might be notified of the pending change or directly thereafter--if the remote site is "aware" that you're linking to its resources. (This is one reason to notify an external site when you create links to its resources.) Then, at the appropriate time, you'll be able to make the change to your local HTML files that include these links.

Several scripts and tools are available that implement this task for you. Tom Christiansen has written a simple one called churl. This simple script does limited verification of URLs in an HTML file retrieved from a server. It verifies the existence and retrievability of HTTP, ftp, and file URLs. It could be modified to suit your needs, and optionally verify relative (local) URLs or partial URLs. It's available at the CPAN in his authors directory:

~/authors/id/TOMC/scripts.

He has also created a number of other useful scripts and tools for use in Web maintenance and security, which also can be retrieved from his directory at any CPAN site.

The other tool we'll mention here, called weblint, is written by Neil Bowers and is probably the most comprehensive package available for verification of HTML files. In addition to checking for the existence of local anchor targets, it also thoroughly checks the other elements in your HTML file.

The weblint tool is available at any CPAN archive, under Neil Bower's authors directory:

~/authors/id/NEILB/weblint-*.tar.gz.

It's widely used and highly recommended. Combining this tool with something such as Tom Christiansen's churl script will give you a complete verification package for your HTML files. See the README file with weblint for a complete description of all the features.

Parsing HTTP Logfiles

As Webmaster, you may be called upon, from time to time, to provide a report of the usage of your Web pages. There may be several reasons for this, not the least of which may be to justify your existence :-). More likely, though, the need will be to get a feel for the general usage model of your Web site or what types of errors are occurring.

Most of the available httpd servers provide you with an access log by default, along with some sort of an error log. Each of these logs has a separate format for its records, but there are a number of common fields, which naturally lends to the object-oriented model for parsing them and producing reports.

We'll be looking at the Logfile module, written by Ulrich Pfeifer, in this section. It provides you with the ability to subclass the base record object and has subclass modules available for a number of servers' log files, including NCSA httpd, Apache httpd, CERN httpd, WUFTP, and others. If there isn't a subclass for your particular server, it's pretty easy to write one. General Issues An HTTP server implements its logging according to configuration settings, usually within the httpd.conf file. The data you have to analyze depends on which log files you enable in the configuration file, or at compile time for the server's source in the case of the Apache server. Several logs can be enabled in the configuration, including the access log, error log, referer log, and agent log. Each of these has information that you may need to summarize or analyze.

Logging Connections

There are some security and privacy issues related to logging too much information. Be sure to keep the appropriate permissions on your logfiles to prevent arbitrary snooping or parsing, and truncate them when you've completed the data gathering. See Chapter 3 for more details.

In general, the httpd log file is a text file with records as lines terminated with the appropriate line terminator for the architecture under which the server is running. The individual records have fields that are strings that form dates, file paths, and hostnames or IP numbers, and other items, usually separated by blank space. Ordinarily, there is one line or record per connection, but some types of transactions generate multiple lines in the log file(s). This should be considered when designing the algorithm and code that parses the log.

The access log gives you general information regarding what site is connecting to your server and what files are being retrieved. The error log receives and records the output from the STDERR filehandle from all connections. Both of these, and especially the error log, may need to be parsed every now and then to see what's happening with your server's connections. Parsing Using the Logfile module, the discrete transaction record, based on some parameter of the request, is abstracted to a Perl object after being parsed. During the process of parsing the log file, the instance variables that are created with the new() method depend on which type of log is being parsed and which field (Hostname, Date, Path, and so on) from the log file you're interested in summarizing. When parsing is complete, the return value, a blessed reference to the Logfile class, has a hash with key/value pairs corresponding to the parameters on which you want to gather statistics about the log and the number of times each one was counted. In the simplest case, you simply write these lines:

use Logfile::Apache;  # to parse the popular Apache server log

$l = new Logfile::Apache  File  => `/usr/local/etc/httpd/logs/access_log',

                            Group => [qw(Host Domain File)];

This parses your access log and returns the blessed reference. Reporting and Summaries After you've invoked the new() method for the Logfile class and passed in your log file to be parsed, you can invoke the report() method on the returned object.

$l->report(Group => File, Sort => Records, Top => 10);

The preceding line produces a report detailing the access counts of each of the top ten files retrieved from your archive and their percentages of the total number of retrievals. For the sample Apache access.conf log file included with the Log file distribution, the results from the report() method look like this:

File                                       Records 

=======================================

/mall/os                                     	5            	35.71% 

/mall/web                                   	3           	21.43% 

/~watkins                                  	3           	21.43% 

/cgi-bin/mall                              	1             	 7.14% 

/graphics/bos-area-map                     	1            	 7.14% 

/~rsalz                                    	1             	 7.14%

You can generate many other reports with the Logfile module, including multiple-variable reports, to suit your needs and interests. See the Logfile documentation as embedded POD in Logfile.pm, for additional information. You can get the Logfile module from the CPAN, from Ulrich Pfeifer's author's directory:

~/authors/id/ULPFR/

The latest release, as of the writing of this chapter, was 0.113. Have a look, and don't forget to give feedback to the author when you can. Generating Graphical Data After you've gotten your reports back from Logfile, you've pretty much exhausted the functionality of the module. In order to produce an image that illustrates the data, you'll need to resort to other means. Because the report gives essentially two-dimensional data, it'll be easy to produce a representative image using the GD module, which was previously introduced in Chapter 12, "Multimedia."

This example provides you with a module that uses the GD class and provides one method to which you should pass a Logfile object, along with some other parameters to specify which field from the log file you wish to graph, the resultant image size, and the font. This method actually would be better placed into the Logfile::Base class, because that's where each of the Logfile subclasses, including the one for Apache logfiles, derive their base methods. It will be submitted to the author of the Logfile module after some additional testing.

For now, just drop the GD_Logfile.pm file (from Listing 14.4) into the Logfile directory in your @INC. You'll also need to have the GD extension and the Logfile module installed, of course. The GD_Logfile module uses the GD package to produce a GIF image of the graph corresponding to data from the report() method from the Logfile class. The entire module, including the graph() subroutine, looks like Listing 14.4.

Listing 14.4. GD_Logfile.pm.

package Logfile::GD_Logfile;



=head1 NAME



GD_Logfile - add a graphing feature the Logfile class, a single method to

allow bar graph generation of log data.  Uses the GD module to produce the

graph.



=head1 SYNOPSYS



    Logfile::GD_Logfile::graph($l,Group => File,

                           Sort => Records,

                           ImSize => [640,480],

                           Font => `gdSmallFont'

                          );



Where $l is a Logfile object, ImSize is the output image size, and Font is

a font from the GD module. All other parameters to the Logfile::report()

method may be included, but only one List variable may be passed in.



=head1 AUTHOR



Bill Middleton - wjm@best.com



=cut



use GD;



sub graph{



my $self  = shift;

my %par = @_;

my $group = $self->group($par{Group});

my $sort  = $par{Sort} || $group;

my $font = $par{Font};

my $rever = (($sort =~ /Date|Hour/) xor $par{Reverse});

my $list  = $par{List};

my ($keys, $key, $val, %keys);

my $direction = ($rever)?'increasing':'decreasing';

my (@list, %absolute);

my (@sorted, $rec_total, $largest, $list_total);

my ($width, $ht, $color, $black, $white);

my ($im, $i, $inc);

my($top,$bottom,$left,$right);

my($color_inc,$title);

my($third,$fourth,$current);



# Instantiate a new GD image based on args or default



if(ref($par{ImSize})){

    $im = new GD::Image(@{$par{ImSize}});

    $right = $par{ImSize}->[0] - 30;

    $top = $par{ImSize}->[1] - 30 ;

    $left = $par{ImSize}->[0] / 2;

    $bottom = $par{ImSize}->[1] /10;

}

else{ # defaults to 640x480

    $im = new GD::Image(640,480); # default

    $right = 610;

    $top = 450;

    $left = 320;

    $bottom = 48;

}



# Set up a few basic colors and sizes



$width = $right - $left;

$ht = $top - $bottom ;

$white = $im->colorAllocate(255, 255, 255);

$black = $im->colorAllocate(0 , 0, 0);

$im->transparent($white);



# Graphs of this sort only make sense with single variable



if ($list) {

    if (ref($list)) {

        die "Sorry, graphs may have only one List variable\n"

    }

} else {

    $list = "Records";

}



# Sum things up



while (($key,$val) = each %{$self->{$group}}) {

    $keys{$key} = $val->{$sort};

    $rec_total+=1;

    $list_total += $val->{$list};

}

(defined $par{Top}) and $rec_total = $par{Top};



# Graph outline



$im->line($left,$top,$right,$top,$black);

$im->line($left,$top,$left,$bottom,$black);



# Graph Title



$title = "Percentages of $list by $group";

$im->string(gdLargeFont,$left,10,$title,$black);

$title = "Total $list = $list_total";

$im->string(gdMediumBoldFont,$left,

    ($top + $bottom/4),$title,$black);



# $i will be our color increment variable for grayscale



$i = 200;

$color_inc = 100 / $rec_total;

$top = $bottom + ($ht / $rec_total);



# A couple of text layout variables



$fourth= (($ht / $rec_total) / 4);

$third = (($ht / $rec_total) / 3);



# Main loop iterates over items, draws the text field and

# rectangle representing the percentage of total for each



for $key ( sort

    {&Logfile::Base::srt($rever, $keys{$a}, $keys{$b})}

        keys %keys){

    my $val = $self->{$group}->{$key};

    next unless defined($val);

    $color = $im->colorAllocate($i, $i, $i);

    if ($key =~ /$;/) {

        die "Sorry, graphs may have only one key\n";

    }

    $current = $top - $fourth * 3;

    $im->string(&{$font},10,$current,$key,$black);

    $title = sprintf("%s(%4.2f%%)",' ` x 5,

            ($val->{$list}/$list_total * 100));

    $current = $top - $third;

    $im->string(&{$font},10,$current,$title,$black);

    $right = $left + ($width * $val->{$list} / $list_total);

    $im->rectangle($left,$top,$right,$bottom,$black);

    $im->fill(($left+1),($bottom+1),$color);

    $bottom = $top;

    $top += $ht / $rec_total;

    last if defined $par{Top} && --$par{Top} <= 0;

    $i -= $color_inc;

}



# Dump the GIF to stdout



print $im->gif;



}

1;

Now you can produce a nice, two-dimensional graph of your log file data with Listing 14.5.

Listing 14.5. GD_Logfile.test.

#!/usr/bin/perl

require Logfile::Apache;

require GD_Logfile;

$l = new Logfile::Apache File  => `apache-ncsa.log',

                    Group => [qw(Domain File )];



Logfile::GD_Logfile::graph($l,Group => File,

                           Sort => Records,

                           ImSize => [640,480],

                           Font => `gdSmallFont'

                          );

This is then invoked as

perl GD_Logfile.test > output.gif

Using the sample Apache access log file provided with the Logfile module, called apache-ncsa.log, the resulting graph is output to the output.gif file and looks like Figure 14.1.

That pretty much wraps up our discussion regarding HTTP log files. It's worth mentioning, however, that you also can use the statistics packages that are available, with some modifications, to summarize the data in a log file. Check your nearest CPAN archive and the Modules List for these statistics modules. Figure 14.1.Graphical representation of log file data.

Converting Existing Documentation to HTML

Another one of the many tasks that the Webmaster may be called upon to perform is to make existing documentation available via the Web. Of course, by configuring the server properly, you can specify a MIME typing to invoke the appropriate application or plug-in on the client to allow the client to view the document. Sometimes, though, it's preferable to actually convert the document into an HTML document and serve it up in the natural way. Using Perl, it's relatively easy to convert any given document to an HTML representation, as long as you know the general layout of the document and have a mapping from the document's internal layout to a corresponding HTML element. General Issues As we've stated, the general algorithm for converting a given document to HTML implies knowing something about the internals of the document. For ASCII text files, this is fairly easy to do. Binary formats require some additional investigation. Furthermore, some types of binary document formats are proprietary and may require that you obtain the structure from the owner or distributor in order to parse them properly. For the sake of brevity, we'll restrict our discussion to parsing and converting ordinary ASCII documents to HTML. Mail Folders The typical UNIX mail file is one example of a text file that has a known format and that's relatively easy to parse and convert to an HTML equivalent. The resulting HTML file may be as useful as a Web document when it contains the messages sent to a public mailing list, for instance. There are numerous means available for making a mail file searchable. Some searching algorithms have already been discussed in this book; however, because we know the format of a mail file, we not only can search through it more easily, but we also can provide a hypertext view of the messages within. Converting a mail file to an HTML document provides the ability to browse all of the messages in the file in the order that they arrived, much like the standard MUA interface, or in a threaded order, according to subject.

The MailArchive tool is available at any CPAN site. It provides a script to process mail folders and store them in a way that can then be accessed through an index.html, which it creates as it parses the mail file. It also provides a search library to allow keyword searches on the archive. The current version, as of this chapter, is 1.9, and its location in the CPAN is

~/authors/id/LFINI/MailArch-1.9.tar.gz.

http://www.oac.uci.edu/indiv/ehood/mhonarc.html Another Perl tool for producing HTML from mail files and folders is called MHonArc, written by Earl Hood. It provides a powerful set of features for indexing, searching, and marking up HTML produced from mail folders. Get all the information about obtaining and using MHonArc from http://www.oac.uci.edu/indiv/ehood/mhonarc.html.

NOTE:

The author of MHonArc, Earl Hood, has done extensive work on the topic of SGML DTD conversion to HTML, as well. While that topic is beyond the scope of this text (and this author :-), Mr. Hood's perlSGML package is also available through the CPAN and at the preceding site.

Simple ASCII Text Documents

http://www.cs.wustl.edu/~seth/txt2html Ordinary ASCII text using regular paragraphs is pretty simple to convert to an HTML representation. Such a conversion is usually preferable in appearance to the raw text in the client browser--for any given text document. While there doesn't seem to be any viable module for performing this task, there is a nice, well-maintained script called txt2html. It uses a simple mapping file, adapted from another Web tools package, to properly extract and format text-form URLs into standard HTML anchors. txt2html also allows you to configure certain parameters in order to control the general format of the output and the decision-making process regarding things that may have multiple meanings within the text document. Again, use your favorite search engine to search for txt2html to obtain this tool. It's currently archived at http://www.cs.wustl.edu/~seth/txt2html but may be on the CPAN at some point in the future. Source Code and Manpages During the installation procedure on UNIX systems, you will need to have Netscape Navigator installed correctly on your system before launching the installer. If you receive an error that the installer is unable to launch Navigator, double check to see if the binary is installed correctly, with the correct permissions for the account you are running the installer from. You also may need to enter the full path to the binary, if it is not correctly in your path. http://www.w3.org/pub/WWW/Tools/ There are a large number of converters for converting C/C++ source code and UNIX manpages to HTML. There are, in fact, so many that it may not be a good idea to try to provide an example here. At this time, there is not a Perl5 module available designed to actually convert various formats to HTML. Thus, until such time as one is formally developed, I'll defer to the definitive list of conversion tools available at http://www.w3.org/pub/WWW/Tools/. Internationalized Content There are a number of specific elements in the HTML spec that map to the international character set. The idea, then, is to parse a document that contains these characters and replace the instances of the hi-bit characters with the corresponding HTML element. Using the libwww modules, this is now quite simple. You've already seen an example for performing this translation, in the earlier listing for the script, relativize (see Listing 14.4).

Converting HTML to Other Formats

The libwww modules provide you with the capability to convert your HTML to straight text and PostScript. See the POD documentation for HTML::Formatter for more details.

There are a number of other tools available for converting to and from a large number of formats not discussed here; again, these tools are available at http://www.w3.org/pub/WWW/Tools/.

Making Existing Archives Available via HTTP

As I mentioned early on in this chapter, the process of sharing documents has evolved with the Internet. Earlier tools and protocols, especially FTP, are still in wide use, especially for larger, archive file formats, such as tar.gz, or .zip files.

CAUTION:

Serving the same archive from an HTTP server and FTP server simultaneously can give the unsavory individual an easy opening to break into your system. Read carefully the considerations and issues discussed in Chapter 3 before doing this. In particular, if you have an upload area for FTP clients, make absolutely sure that the HTTP server can't get to it, or at least can't read anything, and especially not execute anything as CGI or SSI in that directory.

Probably the most important aspect to consider, after you've made things secure, when serving your documents via a means other than HTTP, is the naming conventions you use for them. You need to keep the appropriate extension for MIME, of course, but FTP clients tend to rely more on filenames that are descriptive of content. Another nice thing to do is provide a simple text representation of the index.html in each directory in your hierarchy so that the FTP clients can retrieve a description that isn't marked up as HTML.

Many sites prefer not to make their documentation available via any means other than WWW, and this is certainly okay, but providing the means to obtain your documents via other protocols could certainly increase the rate of their distribution. If this is desirable, then when you set things up, you should consider the users of these other protocols and their limited capability to browse your archive.

Summary

We've covered a lot of ground in this chapter, and I hope I've given you, the Webmaster, a better feel for the many other duties you must perform and how to handle them using Perl. We've covered some of the most important issues arising out of configuration management. We've also covered some of the most common tasks and projects the Webmaster may have to perform.

Some important topics I've covered here have been:

What to plan for when starting out a new archive hierarchy.
Motivation for using revision control and some pointers to existing commercial and noncommercial implementations.
Techniques for parsing and summarizing various server log files.
Converting to/from HTML format and to/from other document formats.

Again, I stress that this chapter is not comprehensive regarding the additional duties that the Webmaster must perform or their solutions.

Step 1: List Directories	The first pass over the archive should create a listing file of all the directories, in the full path form, within the archive. Each entry of the list should have three components: the original name; then if the current directory's name is too long, the list entry should have the original name with any parent directories' new names; followed by the new name, which is shortened to eight alpha-numeric characters and doesn't collide with any other names in the current directory, prepended with all of the parent directories' new names.
Step 2: Rename Directories	Directories should be renamed during this step, based on the list created during pass one. The list has to be sorted hierarchically--from the top level to the lowest level--for the renaming operations to work. The original name of the current directory, with its parent directories' new names as a full path, should be the first argument to `rename()`, followed by the new short name, with any new parents in the path. These should be the second and third elements of the list created during pass one.
Step 3: List Files	The third step makes another pass over the archive, creating another list. This list will have the current (possibly renamed) directory and original filename of each file, as a full path, followed by the current directory and the new filename. The new filename will be shortened to the 8.3 format and with verification, again, that there are no namespace collisions in the current directory.
Step 4: Rename Files	The fourth step should rename files, based on the list created in pass three.
Step 5: Create HTML Fixup List	The fifth step in the algorithm takes both lists created previously and creates one final list, with the original filename or directory name for each file or directory, followed by the current name. Again, both of these should be specified as a full path. This list will then be used to correct any anchors or links that have been affected by this massive change and that live in your HTML files.
Step 6: Fix HTML Files	The final step in the algorithm reads in the list created in Step 5, and opens each file for fixing the internal links that still have the original names and paths of the files. It should refer to the list created in Step 5 to decide whether to change a given URL during the parsing process and overwrite the current HTML file. Line termination characters should be changed to the appropriate one for the new architecture at this time, too.

14 Archive and Document Management

General Archive Management Considerations

Planning, Design, and Layout

NOTE:

Revision Control

Summary of Archive Management Issues

Parsing, Converting, Editing, and Verifying HTML with Perl

General Parsing Issues

Listing 14.1. simpleparse.

Listing 14.2. simpleparse-net.

Editing and Verifying HTML

Listing 14.3. relativize.

Parsing HTTP Logfiles

Listing 14.4. GD_Logfile.pm.

Listing 14.5. GD_Logfile.test.

Converting Existing Documentation to HTML

NOTE:

Converting HTML to Other Formats

Making Existing Archives Available via HTTP

CAUTION:

Summary

14
Archive and Document Management