Chapter 15 Search Engines and Annotation Systems

CONTENTS

Searching Simple Databases
Searching an Entire Web Server
Web Conferencing-Discussion and Annotation Systems

The previous chapters in this book have taught us the virtue of using the World Wide Web paradigm for serving internal documents on the Corporate Network and how to piece together the necessary software and systems that form parts of the whole jigsaw puzzle. The chapters that follow teach us how to use the tools acquired in the first part of the book to build several useful applications that can help the Web serve the Intranet better.

In this chapter, you learn:

How to build simple search and retrieval scripts for simple databases
How to setup and use an advanced search engine to search through the entire Intranet
How to use Web technology as a workgroup tool, including document annotation capabilities and Web conferencing systems

Searching Simple Databases

In this section, we will build a sample search script that can search through a company's phonelist and retrieve the requested information, based on a query. Although this is considered a simple database application, it differs from what is normally thought of as a database because users can only view, but not enter, information. Creating Web database applications that can modify, add, and delete from databases is covered in a later chapter; this chapter is more concerned with search and retrieve applications that are used as guide maps around the vast Web. These applications provide a simple form-query-result paradigm to navigate the Web.

Even though many types of data in an organization are maintained centrally, often they need to be made available to hundreds, or even thousands of users, either internally or externally. Examples of this type of data include a company phone and address book, a product catalog that maps product numbers to titles, or a list of regional sales offices and contacts. All of these types of information can be stored in a relational database, but there's really no need for anything more than a simple text file. If the goal is to make information available, quickly and easily, a simple Web search routine can achieve the desired result without all the headaches of maintenance that is associated with a Relational Database.

NOTE

The examples in this chapter are written using Perl. Since Perl is an open scripting language, these examples are not limited to any particular Operating System or Web Server.

Example Search Scenario

In our example, we will consider a simple text file containing the names and phone numbers of employees in a fictitious company called ABC Inc. It is not uncommon for companies to store employee names and phone numbers electronically. In a typical scenario, the Human Resources department would print out the document and hand out copies to all the employees. ABC Inc. however, is on the cutting edge of technology, as it has employed an Intranet and intends to post the document on the local Web for its employees' benefit. It is up to us to implement a search functionality on the Intranet to enable the user to search for a specific employee's phone number by the employee's name.

Grepping for data

At the heart of the search is the grep command, which simply looks for pattern matches in a file. One of the benefits of this approach is that the text file need not be in any certain format. Grep just reads each line of the file for a match; it doesn't care how many columns there are or what characters are used to separate fields. Consequently, the phone book script can be used to search any text file database.

TIP

Grep is a native UNIX command. The Windows NT version of grep (grep.exe) is included with the Windows NT Resource Kit.

Listing 15.1 A simple search program to sift through a phone list

# search.pl

# Define the location of the database
$DATABASE="\\web_root\\cgi-bin\\phone.txt";

# Define the path to cgiparse
$CGIPATH="\\web_root\\cgi-bin";
# Convert form data to variables
eval '$CGIPATH\\test\\cgiparse -form -prefix $';

# Determine the age of the database
$mod_date=int(-M $DATABASE);	

#Display the age of the database and generate the search form
print <<EOM;
Content-type: text/html

<TITLE>Database Search</TITLE>
<BODY>
<H1>Database Search</H1>
The database was updated $mod_date days ago.<p>	
<FORM ACTION="/cgi-bin/search.pl" METHOD="POST">
Search for: <INPUT TYPE="TEXT" NAME="QUERY"> 
<INPUT TYPE="SUBMIT" VALUE="SEARCH">
</FORM>
<p><hr><p>
EOM

# Do the search only if a query was entered
if (length($query)>0) {	
  print <<EOM;
Search for <B>$query</B> yields these entries:
<PRE>
EOM

#Inform user if search is unsuccessful
$answer = 'grep -i $query $DATABASE';
if (!$answer) { print "Search was unsuccessful\n" ;}
else { print $answer\n" ; }

print <<EOM;
</PRE>
</BODY>
EOM
}

Figure 15.1 : This generalized database search form is used with the search script to search any text file database.

NOTE

Though the above script assumes a windows based Web server, it can be generalized to suit any operating system. When implementing on a UNIX system, the data path has to be modified to replace "\\" with "/".

To use the script for data other than the phone book, simply change the name and location of the text file containing the desired information. Because the script uses the generic grep command, it can be used with almost any text file for any purpose. This script utilizes the cgiparse program to parse the data sent to it. This utility is freely available via anonymous ftp from ftp.ncsa.uiuc.edu.

TIP

You can make searches case-sensitive by removing the -i option from the grep command.

Generating Text Files from Databases

To take advantage of the simple search routine above, you must have some text file data to start with. If your data is currently in another format, such as a proprietary database, you must first convert it to an ASCII text file. You can easily create the necessary text file by exporting the data from the native format to ASCII text. Almost all databases include the capability to export to text files.

TIP

For easiest use of the search script, export data so that there is exactly one record per line. This produces the neatest output from the script.

After the text file has been created, you simply need to specify its path in the search script.

Choosing Between Several Databases

With a few simple modifications, you can use the script generically to search one of many databases that all have different paths. This is done most efficiently in one of two ways. You can allow the database to be chosen by selecting one of several hyperlinks, in which case extra path information in the URL can be used to specify the database. Alternately you can allow the user to choose which database to search in a fill-in form.

Choosing via Hyperlinks

Suppose you want users to be able to choose between several different divisional phone books. One way to do this is to include a pre-search page on which the user selects the database by clicking the appropriate hyperlink. Each link calls the same database search script, but each includes extra path information containing the path to the database. The following HTML demonstrates how the hyperlinks are constructed.

<H2>Company Phonebooks</H2>
<A HREF="/cgi-bin/search.pl/db/IAphone.txt">Iowa Locations</A>
<A HREF="/cgi-bin/search.pl/db/CAphone.txt">California Locations<A>
<A HREF="/cgi-bin/search.pl/db/KSphone.txt">Kansas Locations</A>

The name of the search script in this example is /cgi-bin/search.pl and the databases are named "/db/IAphone.txt," and so on. The search script itself needs to be modified to use the extra path information.

First, the name of the database to search is specified in the extra path information rather than hard-coded into the script. Therefore, the line at the top of the script, which specifies the path to the data, needs to read the extra path information. This is done by reading the PATH_INFO environment variable. In Perl, the syntax for this is:

$DATABASE=$ENV{"PATH_INFO"};

Second, the ACTION attribute of the form, which is generated inside the script, needs to specify the path to the database. This way, after the user performs the initial query, the correct database is still in use. This is accomplished by changing the <FORM ACTION...> line to:

<FORM ACTION="/cgi-bin/search.pl$DATABASE">

NOTE

No slash (/) is necessary to separate the script name (/cgi-bin/search) from the extra path information because $DATABASE already begins with a slash.

These are the two modifications necessary to implement choosing a database via hyperlinks. The hyperlinks to other databases are now included in the search form. The resulting form is shown in figure 15.2. The complete modified script code is included below. Only new or changed lines have been commented.

Figure 15.2 : This method uses hyperlinks to select a new search database.

Listing 15.2 Choosing databases using URLs

# search2.pl

# Get database name from extra path info.
$DATABASE=$ENV{"PATH_INFO"};	

$CGIPATH="\\web_root\\cgi-bin";
eval '$CGIPATH\\test\\cgiparse -form -prefix $';

$mod_date=int(-M $DATABASE);

# Show the current database and list other available databases.
# The <FORM ACTION ...> line now includes the database name as extra path info.
print <<EOM;
Content-type: text/html

<TITLE>Database Search</TITLE>
<BODY>
<H1>Database Search</H1>
Current database is $DATABASE.	Show the current database
It was updated $mod_date days ago.<P>
You can change to one of the following databases at any time:<P>
<A HREF="/cgi-bin/search/db/IAphone.txt">Iowa Location</A><BR>
<A HREF="/cgi-bin/search/db/CAphone.txt">California Locations</A><BR>
<A HREF="/cgi-bin/search/db/KSphone.txt">Kansas Locations</A><P>
<FORM ACTION="/cgi-bin/search2.pl$DATABASE" METHOD="POST">
Search for: <INPUT TYPE="TEXT" NAME="QUERY">
<INPUT TYPE="SUBMIT" VALUE=" Search ">
</FORM>
<p><hr><p>
EOM

if (length($query)>0) {
  print <<EOM;
Search for <B>$query</B> yields these entries:
<PRE>
EOM

$answer = 'grep -i $query $DATABASE';
if (!$answer) { print "Search was unsuccessful\n" ;}
else { print $answer\n" ; }

print <<EOM;
</PRE>
</BODY>
EOM
}

Choosing via a Form

Depending on the application, it may be more convenient for users to choose their database via a form rather than via hyperlinks. The initial form uses Radio buttons to choose the desired database, and after that the chosen database is active for all searches. Figure 15.3 shows the initial form used to select the database. The form code is included below.

Figure 15.3 : In this form, you select the search database and then proceed to the search form.

Listing 15.3 Choosing search database via a form

<TITLE>Database Search</TITLE>
<BODY>
<H1>Database Search</H1>
Choose your database from the list below:<P>
<FORM ACTION="/cgi-bin/search3.pl" METHOD="POST">
<INPUT TYPE="RADIO" NAME="DATABASE" VALUE="/db/IAphone.txt" ÂCHECKED>Iowa Locations<BR>
<INPUT TYPE="RADIO" NAME="DATABASE" VALUE="/db/ÂCAphone.txt">California Locations<BR>
<INPUT TYPE="RADIO" NAME="DATABASE" VALUE="/db/KSphone.txt">Kansas ÂLocations<P>
<INPUT TYPE="SUBMIT" VALUE=" Submit ">
</FORM>
<p><hr><p>

The initial selection form passes the path of the chosen database in the input field named "DATABASE," so only two modifications are necessary to the original search script that receives this information. First, the path to the database is now read from the initial selection form, so a separate line defining $DATABASE is no longer necessary. Second, the search form must have a way to keep track of the current database. This is conveniently accomplished by including a hidden input field in the search form named "DATABASE." This way, whether the search form is called from itself or from the initial selection form, it always knows the path to the correct database. The code for the search script is included below. Only the new or changed lines are commented. The resulting search form appears in figure 15.4.

Figure 15.4 : Once the search database is selected in a separate form, this form is used to perform the search.

Listing 15.4 Passing database name via hidden form fields

# search3.pl

$CGIPATH="\\web_root\\cgi-bin";
eval '$CGIPATH\\test\\cgiparse -form -prefix $';	
# $DATABASE is now defined as a form variable

$mod_date=int(-M $DATABASE);

# A hidden field <INPUT TYPE="HIDDEN" NAME="DATABASE" ...> stores the Âdatabase path.
print <<EOM;
Content-type: text/html

<TITLE>Database Search</TITLE>
<BODY>
<H1>Database Search</H1>
The current database is $DATABASE.
The database was updated $mod_date days ago.<p>
<FORM ACTION="/cgi-bin/search3.pl" METHOD="POST">
<INPUT TYPE="HIDDEN" NAME="DATABASE" VALUE="$DATABASE">
Search for: <INPUT TYPE="TEXT" NAME="QUERY"> 
<INPUT TYPE="SUBMIT" VALUE=" Search ">
</FORM>
<p><hr><p>
EOM

if (length($query)>0) {
  print <<EOM;
Search for <B>$query</B> yields these entries:
<PRE>
EOM

$answer = 'grep -i $query $DATABASE';
if (!$answer) { print "Search was unsuccessful\n" ;}
else { print $answer\n" ; }

print <<EOM;
</PRE>
</BODY>
EOM
}

Searching Multiple Files and Directories

The previous examples searched only one file at a time. However, grep is flexible enough to search multiple files and directories simultaneously.

Searching Multiple Files

In the previous example, the user was allowed to choose between several different phone directories. The script is easily modified to search multiple files simultaneously. Instead of specifying one file in the $DATABASE environment variable, specify a path to the directory containing the phone text files (\db). So, the line beginning $DATABASE= in the original script (search.pl) changes to:

$DATABASE="\\db\\*.txt";

The grep command now searches for the desired information in all files in the \db directory that correspond to the wildcard pattern specified.

Searching Multiple Directories

Taking it a step further, the grep command can also accept multiple files in different directories. For example, you can specify the following database files:

$DATABASE="\\db\\phone*.txt \\db2\\address*.txt"

Now, the grep command searches all .TXT files in the \db directory beginning with phone and all .TXT files in the \db2 directory beginning with address.

Accommodating Form-less Browsers

Although most Web browsers today have forms capability, not all do. To allow these browsers to search for information, it's common to offer an alphabetical or numerical index of data as an alternative to entering a form-based query. Typically, you create a hyperlink for each letter of the alphabet and specify an URL for each hyperlink that performs the appropriate search. For example, in a phone book listing where last names are listed first, you could search for capital C's at the beginning of a line to get a listing of all last names beginning with C. To create a hypertext index that can submit this type of search automatically, write:

Listing 15.5 Breaking down databases alphabetically

<H1>Phone Book Index</H1>
Click on a letter to see last names beginning with that letter.<P>
<A HREF="/cgi-bin/search?A">%26A</A>
<A HREF="/cgi-bin/search?B">%2lb</C>
...
<A HREF="/cgi-bin/search?Z">%26Z</Z>

NOTE

The queries in this example begin with the caret (%26 = "[af]") to force grep to look for the specified character at the beginning of a line.

Searching an Entire Web Server

So far, we have only looked at searching collections of simple text files. This is fine, as long as users are expected to search through specific files only. A good implementation of a Web server, however, is one that includes the capability to search for words anywhere on the server, including plain text and HTML files. It's theoretically possible to simply grep all HTML and TXT files under the document root (and other aliased directories), but this can be very time-consuming if more than a handful of documents are present.

The solution to the problem of searching a large Web server is similar to that used by other types of databases. We maintain a compact index that summarizes the information present in the Web server's content area. As data is added to the database, we just keep updating the index file. The usual method of maintaining the integrity of the index file is to run a nightly (or more frequent) indexing program that generates a full-text index of the entire server in a more compact format than the data itself.

Indexing with ICE

A popular indexing and searching solution on the Web is ICE, written in Perl by Christian Neuss in Germany. It's freely available on the Internet from http://www.informatik.th-darmstadt.de/~neuss/ and is included on the Webmaster CD. In the discussion that follows, we cover ICE, how it works, and how it can be modified to include even more features. By default, ICE includes the following features:

Whole-word searching using Boolean operators (AND and OR)
Case-sensitive or case-insensitive searching
Hypertext presentation of scored results
The ability to look for similarly-spelled words in a dictionary
The ability to find related words and topics in a thesaurus
The ability to limit searches to a specified directory tree

ICE presents results in a convenient hypertext format. It displays results using both document titles (as specified by HTML <TITLE> tags) and physical file names. Search results are scored, or weighted, based on the number of occurrences of the search word, or words, inside documents.

NOTE

Since ICE is written completely in the Perl programming language, the software works under UNIX as well as under MacOS and Windows.

The ICE Index Builder

The heart of ICE is a Perl program that reads every file on the Web server and constructs a full-text index. The index builder, "ice-idx.pl" in the default distribution, has a simple method of operation. The server administrator specifies the locations and extensions (TXT, HTML, etc.) of files to be indexed. When we run ice-idx.pl, it reads every file in the specified directories and stores the index information in one large index file (by default, index.idx). The words in each file are alphabetized and counted for use in scoring the search results when a search is made. The format of the index file is simple:

Listing 15.6 Format of ICE Index file

@ffilename
@ttitle
word1 count1
word2 count2
word3 count3
...
@ffilename
@ttitle
word1 count1
...

Running the Index Builder

The index builder is run nightly, or at some other regular interval, so that search results are always based on updated information. Normally, ICE indexes the entire contents of directories specified by the administrator, but it can be modified to index only new or modified files, as determined by the last modification dates on files. This saves a little time, although ICE zips right along as it is. On a fast machine, ICE can index 2-5M of files in under 15 seconds, depending on the nature of the files. Assuming an average HTML file size of 10K, that's 200-500 separate documents.

Windows NT users can use the native at command to schedule the indexing utility. UNIX users can use cron for scheduling.

NOTE

It's often a good idea to schedule at or cron jobs at odd times because many other jobs run on the hour by necessity or convention. Running jobs on the hour that don't have to be run this way increases the load on the machine unnecessarily.

TIP

The Windows NT scheduler service has to be running in order to schedule jobs using the at command.

Space Considerations

Searching an index file is much faster than using grep or a similar utility to search an entire Web server; however, there is a definite space/performance tradeoff. Because ICE stores the contents of every document in the index file, the index file could theoretically grow as large as the sum of all the files indexed. The actual "compression" ratio is closer to 2:1 for HTML because ICE ignores HTML formatting tags, numbers, and special characters. In addition, typical documents use many words multiple times, but ICE stores them only once, along with a word count.

NOTE

When planning your Web server, be sure to include enough space for index files if you plan to offer full-featured searching.

The Search Engine

The HTML that produces the ICE search form is actually generated from within a script (ice-form.pl), but calls the main search engine (ice.pl) to do most of the search work. The search simply reads the index file previously generated by the index builder. As the search engine reads consecutively through the file, it simply outputs the names and titles of all documents containing the search word or words. The search form itself and the search engine can be modified to produce output in any format desired by editing the Perl code.

Tips and Tricks

The ICE search engine is powerful and useful by itself. There's always room for improvement, however. This section discusses several modifications you can make to ICE to implement additional features.

Directory Context

A very useful feature of ICE is the ability to specify an optional directory context in the search form. This way, you can use the same ICE code to conduct both local and global searches. For example, suppose you're running an internal server that contains several policy manuals and you want each of them to be searchable individually, as well as together. You could simply require that users of the system enter the optional directory context themselves; however, a more convenient way is to replace the optional directory context box with radio buttons that can be used to select the desired manual.

A more programming-intensive method is to provide a link to the search page on the index page of each manual. The URL in the link can include the optional directory context so that users don't have to enter this themselves. This way, when a user clicks the link to the search page from within a given manual section, the search form automatically includes the correct directory context. For example, you can tell the ICE search to look only in the /benefits directory by including the following hyperink on the Benefits page:

<A HREF="/cgi-bin/ice-form.pl?context=%2Fbenefits>Search this manual</A>

NOTE

The slash (/) in front of benefits must be encoded in its ASCII representation ("%2F") for the link to work properly.

In order for this to work, you'll need to make the following necessary modifications to ice-form.pl:

Set the variable $CONTEXT at the beginning of the script (using cgiparse or your favorite parsing utility) based on what was passed in from the search URL.
Automatically display the value of $CONTEXT in the optional directory context box (<INPUT TYPE="TEXT" NAME="CONTEXT" VALUE="$CONTEXT">).

Speed Enhancements

If the size of your index file grows larger than two or three megabytes, searches take several seconds to complete, due to the time required to read through the entire index file during each search. A simple way to improve this situation is to build several smaller index files (say, one for each major directory on your server), rather than one large one. However, this means you can no longer conduct a single, global search of your server.

A more attractive way to break up the large index file is to split it up into several smaller ones, where each small index file still contains an index for every file searched, but only those words beginning with certain letters. For example, ice-a.idx contains all words beginning with "a," ice-b.idx contains all words beginning with "b," and so on. This way, when a query is entered, the search engine is able to narrow down the search immediately based on the first letter of the query.

NOTE

In the event that your server outgrows the first-letter indexing scheme, the same technique can be used to further break up files by using unique combinations of the first two letters of a query, and so on

In order to break up the large index file alphabetically, you need to modify the ICE index builder (ice-idx.pl) to write to multiple index files while building the code. The search engine (ice.pl) also needs to be modified to auto-select the index file based on the first letter of the query.

Searching for Words Near Each Other

Although ICE allows the use of AND and OR operators to modify searches, it only looks for words meeting these requirements anywhere in the same document. It would be nice to be able to specify how close to each other the words must appear, as well. The difficulty with this kind of a search is that the ICE index doesn't specify how close to each other words are in a document. There are two ways to overcome this.

First, you can modify the index builder to store word position information, as well as word count. For example, if the words "bad" and "dog" each occur three times in a file, their index entries might look like this:

bad 3 26 42 66
dog 3 4 9 27

In this case, 3 is the number of occurrences, and the remaining numbers indicate that "dog" is the 4th, 9th, and 27th word in the file. When a search for "bad dog" is entered, the search engine first checks if both "bad" and "dog" are in any documents, and then whether any of the word positions for "bad" are exactly one less than any of those for "dog." In this case, that is true, as "bad" occurs in position 26 and "dog" occurs in position 27.

There's another way to search for words near each other. After a search is entered and files containing both words are found, those files can simply be read by the search program word-by-word, looking for the target words near each other. Using this method, the index builder itself doesn't have to be modified. However, the first method usually results in faster searches because the extra work is done primarily by the index builder rather than by the search engine in real-time.

WAIS

Yet another popular freeware search utility for Web and Gopher servers running on Windows NT, is the European Microsoft windows NT Academic Center's (EMWAC) Wide Area Information Server (WAIS). It's included in the Webmaster CD.

WAIS Architecture

WAIS is comprised of three basic components:

WAISSERV-a protocol handler and search engine
WAISINDEX-the indexing utility
WAISLOOK-the search utility

The WAIS Search engine implements features like Boolean (and, or, not) searches and synonym files.

WAIS Operation

Operation of WAIS is similar to that of ICE. It involves creation and periodic updating of the index files.

NOTE

The configuration information of WAIS is setup using the WAIS Control Panel Applet.

Figure 15.5 : WAIS Server configuration applet in the control panel.

Once the configuration information is setup, the index can then be created using the WAISINDX program. The WAISINDX program can be used to create indexes that are intended to be used internally, within the site, or it can be used with the -export option, which enables us to register it with the database of databases, thus opening our database to public use. To register, send the index.src file to the following e-mail addresses:

wais-directory-of-servers@cnidr.org
wais-directory-of-servers@quake.think.com

NOTE

To export a WAIS database and register it with the WAIS Database of databases, check the information in index.src; make sure it contains an IP address and a DNS name, as well as the TCP/IP port under which the WAIS Server is running.

Web Conferencing-Discussion and Annotation Systems

The World Wide Web was originally developed as a medium for scientific and technical exchange. One of the important elements of that exchange is the sharing of ideas about other people's work. This has been common on UseNet news for many years now, but articles are limited largely to plain ASCII text. The Web, with its superior hypertext presentation, presents opportunities for richer exchange, but has developed as a remarkably one-sided communications medium thus far. This is unfortunate for those who would like to take advantage of the Web's superior document capabilities along with the flexibility and interactivity of UseNet.

Why is the Web one-way?

In spite of various techniques, such as CGI scripting, the World Wide Web is still primarily a one-way medium, with the client issuing requests and the server supplying requested documents. These limitations are not fundamental to either the HTTP or HTML. The ingredients necessary for world-wide annotation of Web documents and posting new documents to servers are already in place, but these have not yet been implemented. There are, however, a few exceptions; we will discuss these in the following section.

Group Annotations

The most notable exception is NCSA Mosaic, which supported a feature called group annotations in the first few versions. This feature enables users to post text-only annotations to documents by sending annotations to a group annotation server, which NCSA provided with earlier versions of their Web server. Group annotations, however, have been abandoned in later versions of Mosaic in favor of the HTTP 1.0 protocol, which supports group annotations differently.

CGI and HTTP POST

The second exception is CGI scripting, which enables the server to both send and receive data. The data is usually simple text, such as a query or form information, but it can also be an entire document, such as an HTML file, spreadsheet, or even an executable program. The ability to post documents to CGI scripts, however, is not particularly useful, as of yet, because Web clients don't support it. What would be useful is an introduction of a <FILE> element to forms, which, when selected, would ask the user to specify the name of a local file to be sent to the server when the form is submitted. This would be a convenient way to upload documents to a Web server, similar to the way that documents are uploaded to CompuServe or bulletin board systems.

Because HTTP and HTML already support most (if not all) of the ingredients necessary for a more interactive Web, it's probably only a matter of time before these will be incorporated into browsers and servers alike. In the meantime, however, prototypes of what the future holds have been constructed using news, e-mail, and CGI scripts.

News and the Web

UseNet news makes available today in plain ASCII text some of what the Web will do tomorrow in HTML. News can effectively be used as a private or public tool for information exchange. Public newsgroups are the most familiar; with world-wide distribution, they enable anyone to post articles. By running your own news server, you can also create entirely private newsgroups (as for an internal bulletin board system) or semi-private groups, which the public can read but not post to. The capability to control who can read news and who can post to a local server makes news a useful tool for workgroup discussion.

TIP

Many Web browsers can both read and post news. This simplifies the use of both news and hypertext in an organizational context by providing a common interface for viewing both kinds of documents.

While news is an excellent medium for conducting entirely private (inside a corporate network) or entirely public conversations (UseNet), it's not as well suited for allowing discussions between a select group of individuals located all over the world. It's possible to create a special news server for this purpose and use password security to ensure that only the intended group of people can read or post news to the server. However, users of the system would be inconvenienced because most news readers expect to connect to one news server only. If users were already connecting to another news server to receive public news, they would have to change the configuration information in their news reader in order to connect to the special server. Fortunately, there are other answers to this problem.

Hypermail

E-mail is a more flexible method of having semi-private discussions among people all around a large Intranet. Using a mailing list server (list server), it is possible to create a single e-mail address for a whole group of people. When an item is sent to the mailing list address, it's forwarded to all members of the list. This approach has several advantages over running a news server, in addition to the previously mentioned convenience issue.

TIP

Through various e-mail gateways, it's possible to do almost anything by e-mail that can be done on FTP, Gopher, news, or the Web, only slower.

A very nice complement to a mailing list is a mailing list archive, which stores past items on the mailing list. Public mailing list archives can be stored on the Web for the benefit of later reference. A really powerful tool called hypermail converts a mailing list archive into a hypertext list of messages, neatly organized to show message threads. Mail archives converted with hypermail can be sorted by author, subject, or date.

TIP

A commercial mail server for Windows NT, which integrates other features such as List Server, Hypermail, and so on, is NTMail. Information on NTMail is available at http://www.mortimer.com/ntmail/default.htm.

TIP

Hypermail for UNIX is available free of charge under a license agreement, at http://www.eit.com/software/hypermail/.

Annotation Systems

While e-mail and news are both valuable tools for workgroup discussion, they still lack an important feature: the ability to make comments on a document in the document itself. In the paper world, this is accomplished with the infamous red pen. However, the equivalent of the editor's pen in the world of hypertext markup is just beginning to manifest. The ultimate in annotation would be the ability to attach comments, or even files of any type, anywhere inside an HTML document. For now, however, it's at least possible to add comments to the end of an HTML page. Several people are working on annotation systems using existing Web technology. The following sections take a brief look at a few of them.

HyperNews

Not to be confused with hypermail, HyperNews does not actually use the UseNet news protocol, but it allows a similar discussion format and is patterned after UseNet. You can see examples of HyperNews and find out more about it at http://union.ncsa.uiuc.edu/HyperNews/get/hypernews.html. Figure 15.6 shows a sample screen of a browser access to a HyperNews server.

Figure 15.6 : A sample HyperNews session.

W3 Interactive Talk (WIT)

A similar system originating at CERN allows new "proposals," or comments, to be submitted in response to a given document. This is a practical way for a group of engineers, for example, to discuss a document. Some degree of security is possible by requiring users to have a valid username and password before they can post comments. This can be combined with user authorization procedures to control who can see documents, as well. More information on W3 Interactive Talk is available at http://www.w3.org/hypertext/WWW/WIT/User/Overview.html.

Web Conferencing Systems

The glaring deficiency of the Web, namely, that it has been a one-way drive, has not gone unnoticed, however. There are quite a few systems available that employ the traditional client/server architecture to implement Web conferencing systems.

One commercially available Web conferencing product for Windows NT is WebNotes, a product of OS TECHnologies Corporation. WebNotes is a client/server solution where the "client" is any HTML capable Web browser (Mosaic, Netscape, and so on). The WebNotes server software maintains discussion threads of topics of discussion, remembers "already-seen" messages by users, and enables users to post discussion material, either as text or as HTML documents with inline graphics. It also employs a text search engine that facilitates retrieving discussions based on the result of a search query. Figures 15.7 and 15.8 show sample screens of discussion threads and the general navigation concepts.

Figure 15.7 : A sample WebNotes discussion thread.

Figure 15.8 : Webnotes discussion thread drill down.

NOTE

More information and a live demonstration of WebNotes can be found on OS TECHnologies' home page at http://www.ostech.com.

Yet another powerful freeware Web Conferencing system for UNIX is COW-conferencing on Web.

Other Web conferencing systems that can be found on the Net include, but are not limited to:

Agora Web Conferencing System-http://www.ontrac.yorku.ca/agora
WebBoard from O'Reilly and associates-http://webboard.ora.com/
Futplex system-http://gewis.win.tue.nl/applications/futplex/index.html
Cold Fusion Forums from Allaire-http://www.allaire.com/
InterNotes from Lotus-http://www.lotus.com/inotes

Some of these systems also enable users to upload files to the server, thereby allowing them to upload picture binaries to inline their message content with graphics.

Academic Annotation Systems

Many of the annotation-like systems on the Web today are academic in nature. At Cornell, a test case involving a computer science class allows students to share thoughts and questions about the class via the Web. Documentation on the Cornell system is available from http://dri.cornell.edu/pub/davis/annotation.html. The Cornell site also has useful links to related work on the Web. Some of the related systems that have been developed use custom clients to talk to an annotation database separate from the Web server itself, much like the early versions of Mosaic. This architecture may well be the future of annotations and the Web.

On the lighter side, take a peek at MIT's Discuss->WWW Gateway to get a behind-the-scenes look into an American hall of higher education. For a particularly novel and entertaining use of the Web, take a peek at the Professor's Quote Board at http://www.mit.edu:8008/bloom-picayune.mit.edu/pqb/.

Product	Company	Address
Excite	Architext Software	www.excite.com
Livelink Search	OpenText Corp	www.opentext.com
Verity	Verity Inc	www.verity.com
CompasSearch	CompasWare development Inc	www.compasware.com
NetAnswer	Dataware Technologies Inc	www.dataware.com
Fulcrum Search server	Fulcrum Technologies Inc	www.fultech.com

Chapter 15

Search Engines and Annotation Systems

Searching Simple Databases

Example Search Scenario

Grepping for data

Generating Text Files from Databases

Choosing Between Several Databases

Choosing via Hyperlinks

Choosing via a Form

Searching Multiple Files and Directories

Searching Multiple Files

Searching Multiple Directories

Accommodating Form-less Browsers

Searching an Entire Web Server

Indexing with ICE

The ICE Index Builder

Running the Index Builder

Space Considerations

The Search Engine

Tips and Tricks

Directory Context

Speed Enhancements

Searching for Words Near Each Other

WAIS

WAIS Architecture

WAIS Operation

Other Web Search Solutions

Including Content

Web Conferencing-Discussion and Annotation Systems

Why is the Web one-way?

Group Annotations

CGI and HTTP POST

News and the Web

Hypermail

Annotation Systems

HyperNews

W3 Interactive Talk (WIT)

Web Conferencing Systems

Academic Annotation Systems