Chapter 11

Implementing Keyword Searches


CONTENTS

Bigger isn't necessarily better. One of the latest crazes in the retail market is the "mega mall" or "super store." Take a regular shopping mall (or normal retail store) and increase its size by an exponential degree. Your purchase choices increase dramatically, assuming that you can find what you're looking for. This is why it's pretty common to see a sizable crowd of people huddled around the information kiosks interspersed about these massive establishments. If it wasn't for a map, index, or guidebook of some sort, nobody could find anything and probably wouldn't come back.

Web sites aren't much different. As a site starts to grow, it becomes increasingly difficult to locate just what you surfed in for. While shopping malls have maps, Web sites have search engines-central query pages where you can ask for directions.

How Much Is Enough?

As a Web site grows larger in size and complexity, it reaches a point where having a means to easily locate information within the myriad pages is not only nice, it's necessary. Whether it's for your own benefit as the site's administrator, because even the best administrators can get lost in the documents, or that of the general Web community if you're running an online store, for example, the first step in preparing your site for indexing is to determine what parts of your site you want to include in the search.

For example, indexing graphics may not be necessary for your site if the only purpose they serve is a pleasing visual presentation. On the other hand, if you are maintaining a large catalog of items, each with its own graphic, adding the images to the index may be beneficial. A general rule of thumb: let your content decide your index.

In order to deliver the results of a query as quickly as possible, indexing programs rely heavily on an index file that contains all the important associations between various keywords and the documents that make up your site. Because these index files are generated by programs that search the entire contents of a Web site, you must also decide how often you need to generate an updated index file. For those managing a handful of pages, updating the index each time you make changes is feasible.

However, if you manage a large site or a server that multi-hosts several different sites whose pages are in constant flux from various sources, you should consider automating your indexing process. If it happens regularly (say, once a week) and at a time when you know the site activity is minimal, the processing time used by the indexing software won't slow down the server response for your users.

Finally, you need to take a look at how much material you wish to include in the index, especially the size of the resulting index file. It's common for the size of an index file to far exceed the size of the data it indexes. If your site is rather large, it may be beneficial to break the index up into smaller indexes, with each covering a different directory tree, for instance. Even so, it's still necessary to build a master index or a collection of related indexes covering every possible term that your users could search for within your pages.

There are several different ways to decide what's worth indexing and what's worth ignoring, such as:

The bottom line is that it's ultimately up to the site administrator-you-to decide what and what not to index. If all else fails, you can always fall back on the general rule: If you have the space and the processor time, index everything and index often.

Because indexing is such a common feature on the Web, there are already several freeware, shareware, or commercial programs out there that do the bulk of the work for you. Rather than diving deeply into the heuristics of building search indexes, the following pages take a look at three of the more popular index managers: freeWAIS, SWISH, and WWWWAIS.

freeWAIS

The WAIS (Wide Area Information System) project is an experiment automating the search and retrieval of many types of electronic information over wide area networks, such as the Internet. freeWAIS, developed by the Center for Networked Information Discovery and Retrieval (http://www.cnidr.org/), is a simple implementation of the WAIS database system.

Like most software for UNIX platforms, freeWAIS must be compiled before you can implement it on your system (you'll find a copy of the distribution on the companion CD-ROM). Once installed, it sits in the background until you've finished determining which directories you want to include in the index.

One simple method of setting up your directories for indexing is to create a directory containing symbolic links for all the directories you want to index. Once the associated links have been created, you can easily index all or part of the directory containing them, without having to reconfigure the Web site itself.

The actual indexing process is handled by waisindex, the application that you use for this purpose, located in the same directory as your other WAIS tools. To generate the index, you can use a command line similar to:

/wais/waisindex -l 1 -e /dev/null -d */wais/WEB/pointers
-r -mem 6 /wais/pointers

that runs waisindex with the following parameters:

-l 1-Controls the verbosity of the created log (you may wish to experiment with this to get a better feel for how much information is generated).
-e /dev/null-Routes any error messages off to the trash.
-d */wais/WEB/pointers-Specifies the directory to start the indexing process in.
-r-Forces subdirectory recursion, allowing waisIndex to search entire trees.

After the index file is created, you can test the search capabilities using waisq. For example:

/wais/waisq -m 100 -c /wais/ -f -S users.src -g Scott Walter

searches the index users.src for the home page of Scott Walter. The result returned by waisq will indicate how many matches are found and the matches themselves. This is the information that you'll need to evaluate when writing the CGI scripts to display the results of a user's search.

SWISH

Another indexing tool commonly used is SWISH, which you'll also find on the companion CD-ROM. Unlike WAIS systems, SWISH (Simple Web Indexing System for Humans) is, as the name implies, simple to set up and use. It's an ideal first choice for creating indexes of small sites, especially if all you want is a search engine for Web-related data, as WAIS can index anything.

Because SWISH is designed as a simple implementation of indexing, it's not necessarily for everybody. If you're interested in implementing a more robust index interface, you'd probably be better off with WAIS. However, if you want "quick and dirty," then SWISH is an excellent choice.

Using CGIs to Format the Output

Once you create your indexes, process a query, and get back the search results, you need to process the returned data into a form that's palatable to the Web server. One of the best programs available that does this for you is wwwwais, an elegant program written in C that serves as a "gatekeeper" between the Web and a WAIS or SWISH system.

To use WWWWAIS, you first configure the program to support the WAIS database you indexed (created) earlier. This usually involves changing one line in the source code to point to the location of an external configuration file, then compiling. Once compiled, any further configuration can be made by editing the configuration file.

This defines the location of the file that WWWWAIS uses for all of its on-the-fly configuration. Once the application program is built, you will never need to edit the source code, so all changes to your WWW environment can be handled gracefully by editing the "wwwwais.conf" file that you specified in the program.

Customizing the Search Results

Once a search is complete, the final task is to handle the resulting output. If you don't have the time or patience to write a Perl script to customize the output for either a WAISQ or SWISH search, WWWWAIS offers a perfect solution-a built-in forms generator capable of recursively calling itself. Using its own output-formatting capabilities saves you the time necessary to write (and debug) your own.

However, the resulting output is somewhat generic, so if you need a particular "look"-perhaps due to corporate specifications or you simply desire your own style, you've little alternative but to start rolling your own script. However, if you look at the default data format presented by the search engine, you will probably discover that "tweaking" it to suit your needs is not very difficult.

Not Running UNIX?

While the previous programs are UNIX-based, Web masters running servers on other operating systems can still implement indexes within their sites.

Macintosh

For the Macintosh, Global HTTP Contents 1.0, written in AppleScript, generates an HTML-formatted document containing a hierarchical list of a folder's contents and all its subfolders. Here's where you can check it out:

http://arpp1.carleton.ca/machttp/util/global/.

Windows

More and more servers are running under Windows, whether it's NT, 95, or even 3.1. For Web sites working within Microsoft's operating systems, one Web server in particular has indexing already built in: WebSite from O'Reilly and Associates (http://website.ora.com/). WebSite also provides an easy-to-use, graphical interface for maintaining the various facets of the site.

For those looking for an index engine to implement in an existing site, Excite has versions available for Windows, which you'll find on the companion CD-ROM. By design, Excite is easy to install, easy to maintain, and very fast when it comes to query processing.

Microsoft has also joined the index engine world with the release of its Index Server (code named Tripoli) for Windows NT Server 4.0 (http://www.microsoft.com/ntserver/). Because of the tight coupling between the Index Server and the operating system itself, Tripoli is relatively simple to configure and quickly indexes the entire contents of your site. Visitors to your site can also take advantage of the built-in multilingual features, and search your site in one of seven different languages.

From Here…

This chapter was a brief overview of the tools available to add indexing to your site. If you want to explore the process of indexing and its uses in greater detail, you may want to check out: