9
AgentsWeb Scanning, Mirroring, and Background Tasks

by Brian Deng



This chapter focuses on agents that make use of the Web protocol to perform some automated tasks. Many Webmaster responsibilities, such as figuring out when links are stale, generating usage reports, generating search indexes and mirroring of sites are easily automated using Perl. In addition to these server-related background tasks, consider the usefulness of client-side automation, such as retrieving up-to-the-minute information including news headlines or stock quotes.

This chapter shows you how to leverage existing Perl modules to make these automated tasks even easier. These are just a few examples, but you can apply what you learn here toward some tasks specific to your needs.

Retrieving Specific Documents from the Web

Retrieving documents from the Web is what everyone does when they surf the Web. The Web browser provides a nice front-end navigation tool for this type of interactive retrieval. You can also retrieve documents in an automated way by using the HTTP protocol within a Perl script. The most common example of this is retrieving stock quotes. You can think of Web servers as the information providers and the user agents as the information retrievers. Suppose a Web server provides up-to-the-minute news, sports scores, stock quotes, and so on. You can write a fairly simple script in Perl to monitor these Web sites and provide you with that up-to-date information.

Stock Quotes on the Hour

Stock quotes are, of course, the most obvious application for retrieving information from the Web. Public Web pages are available from which you can get the latest stock prices at the click of a button. This example shows you how to write your own customized Perl script to tell you the current stock price every hour on the hour. You can simply feed it stock symbols, and it retrieves the information, parses it, and displays only what you are interested in.

One Web site that provides stock quotes is the Security APL Quote Server. The URL for obtaining quotes is http://qs.secapl.com/cgi-bin/qso. After spending some time figuring out the format of the data coming back, it's easy to come up with regular expressions for extracting the price data, percent of fluctuation, date, and time. To specify a list of quotes to retrieve in the URL, append the string "?tick=symbol1+symbol2". This string contains the parameter list that is passed to the quote serving CGI script. This particular site allows you to specify up to five stock symbols at a time. The data coming back contains the stock quotes separated by a horizontal line tag, <HR>. Each quote begins and ends with the pre-formatted text tags, <PRE> and </PRE>. The HTML in Listing 9.1 is a sample returned by the quote server for two stock symbols. Figure 9.1 shows this page in Netscape.

Listing 9.1. HTML returned by the Security APL Quote Server.

<HTML><HEAD>

<TITLE>Security APL Quote Server</TITLE>

<LINK REV="made" HREF="dhp@secapl.com"></HEAD>

<CENTER>

<NOBR>

<A HREF=http://www.secapl.com/HOMELink>

<IMG ALIGN=CENTER BORDER=0 SRC=http://www.secapl.com/qsImages/apl.gif 

     ALT=" Security APL" HEIGHT=80 WIDTH=90></A>

<B>

<FONT SIZE=+3> S</FONT><FONT SIZE=+1>ecurity <FONT SIZE=+3>APL</FONT>

<FONT SIZE=+3>Q</FONT>uote<FONT SIZE=+3>S</FONT>erver</FONT>

</B>

<HR WIDTH=600 SIZE=3 NOSHADE>

<A HREF=http://www.secapl.com/ADLink>

<IMG SRC=http://www.secapl.com/qsImages/barron.gif WIDTH=600 

 HEIGHT=60 BORDER=0 ALT="Barrons" BORDER=0></A><BR>

<HR WIDTH=600 SIZE=3 NOSHADE>

<FONT SIZE=-1>

<I>If your browser does not support tables, see the 

<A HREF=http://www.secapl.com/cgi-bin/qso><B>alternate Quote Server</B>

</A> page.</I>

</FONT>

<P>

<FORM METHOD="POST" ACTION="http://qs.secapl.com/cgi-bin/qso">

<A HREF=http://www.secapl.com/secapl/quoteserver/ticks.html>

<B>Ticker Symbols</A> : </B>

<I>(Up to 5 tickers may be entered separated by spaces)</I><BR>

<DD><INPUT NAME="tick" SIZE=30 MAXLENGTH=50>

<B><FONT COLOR=0000FF><INPUT TYPE="submit" VALUE="  Get Quotes  ">

</FONT></B></FORM>

</CENTER><PRE>

Symbol        : ADBE         Exchange    : NASDAQ

Description   : ADOBE SYSTEMS INC                       

Last Traded at: 35.1250      Date/Time   : Jul 05  1:01:34

$ Change      : -0.1250      % Change    : -0.35   



Volume        : 330300       # of Trades : 251      

Bid           : 35.1250      Ask         : 35.2500  

Day Low       : 35.1250      Day High    : 35.7500  

52 Week Low   : 30.0000      52 Week High: 74.2500  

</PRE><CENTER>

<A HREF="http://www.secapl.com/cgi-bin/edgarlink?'ADBE'">WWW hyperlinks

</A> for the symbol ADBE are available

including those from the

<A HREF="http://town.hall.org/edgar/edgar.html">EDGAR Dissemination Project.</A>

<HR>

</CENTER><PRE>

Symbol        : NSCP         Exchange    : NASDAQ

Description   : NETSCAPE COMMUNICATIONS CP              

Last Traded at: 58.2500      Date/Time   : Jul 05  1:00:36

$ Change      : -2.7500      % Change    : -4.51   



Volume        : 1000900      # of Trades : 803      

Bid           : 58.2500      Ask         : 59.0000  

Day Low       : 57.5000      Day High    : 60.2500  

52 Week Low   : 22.8750      52 Week High: 87.0000  

</PRE><CENTER>

<A HREF="http://www.secapl.com/cgi-bin/edgarlink?'NSCP'">

WWW hyperlinks</A> for the symbol NSCP are available

including those from the

<A HREF="http://town.hall.org/edgar/edgar.html">EDGAR Dissemination Project.</A>

<HR>

<P>

<A HREF=http://www.secapl.com/secapl/quoteserver/mw.html>

<FONT SIZE=+1><B>Market Watch</B></FONT></A> A Detailed Look at Market Activity

<BR>

<HR WIDTH=600 SIZE=3 NOSHADE>

<A HREF=http://www.secapl.com/PAWLink>

<IMG SRC=http://www.secapl.com/qsImages/qs.gif HEIGHT=30 WIDTH=600 BORDER=0 

ALT=></A><BR>

<HR WIDTH=600 SIZE=3 NOSHADE>

<B>

<A HREF=http://www.secapl.com/PORTVUELink> PORTVUE</A> -

<A HREF=http://www.secapl.com/NEWLink.html> WhatsNew</A> -

<A HREF=http://www.secapl.com/PAWLink>PAWWS</A> -

<FONT COLOR=777777>QuoteServer</FONT> -

<A HREF=http://www.secapl.com/PODIUMLink>PODIUM</A> -

<A HREF=http://www.secapl.com/SPONSORLink>Sponsored Sites</A>

</B>

<HR WIDTH=600 SIZE=1 NOSHADE>

<FONT SIZE=+1>

<B>

<A HREF=http://www.secapl.com/secapl/quoteserver/search.html>Ticker Search</A> -

<A HREF=http://www.secapl.com/secapl/qsq1.html>Questionnaire</A> -

<A HREF=http://pawws.com/C_phtml/calculators.shtml>Financial Calculators</A>

<HR WIDTH=100 SIZE=1 NOSHADE>

<A HREF=http://www.secapl.com/NEWLink>What's New</A></B></FONT>

  -- Jun 4 1996: <A HREF=http://www.secapl.com/APLACCESSLink>Security APLACCESS

</A> - electronic statement delivery via the Internet

<BR>

<HR WIDTH=600 SIZE=1 NOSHADE>

<HR WIDTH=100 SIZE=1 NOSHADE>

<FONT SIZE=-1>

<A HREF=http://www.secapl.com/HOMELink>

<B>Security APL</B></A>

<I>and</I> <B><A HREF=http://www.secapl.com/NAQLink>

North American Quotations, Inc.</B></A>

<I>make no claims concerning the validity<BR>

of the information provided herein, and will not be held liable 

for any use thereof.</I>

</B>

</FONT>

</NOBR>

<HR WIDTH=600 SIZE=3 NOSHADE>

<I><A HREF=http://www.secapl.com/HOMELink>Security APL</A><BR>

<A HREF=mailto:g.www@secapl.com>g.www@secapl.com</A></I>

</CENTER></BODY></HTML>

To make our lives a lot easier, we won't attempt to submit the URL request by using raw socket calls--even though this can be done using Perl. Instead, let's use the WWW libraries available in the CPAN. Two important modules are HTTP::Request and HTTP::Response. Here, you'll use the HTTP::Request module to package up the URL request and the HTTP::Response module to handle the data coming back. The other module needed here is the LWP::UserAgent module. This powerful module acts as our communications vehicle. The code to retrieve data from the URL looks as simple as Listing 9.2.

Figure 9.1. The Security APL Quote Server, seen in Netscape.

Listing 9.2. Automatic stock quote retriever (getquote.pl).

#!/public/bin/perl5



require LWP::UserAgent;

require HTTP::Request;



$anHour=60*60;

$symbols=join(`+', @ARGV);

$url="http://qs.secapl.com/cgi-bin/qso?tick=$symbols";



$ua = new LWP::UserAgent;

$request = new HTTP::Request `GET', $url;

while (1) {

   $response = $ua->request($request);

   if ($response->is_success) {

      &handleResponse($response);

   } else {

      &handleError($response);

   }

   # We want to receive quotes every hour.

   sleep $anHour;

}

As you can see, the symbols are passed in as arguments to this Perl script. They are then joined into a single string with each symbol separated by a plus sign. This string is appended to the URL creating the full URL request for the specified stock symbols. This leaves only the handleResponse() and handleError() subroutines left to implement. handleError() is rather easy. HTTP::Response provides a method called error_as_HTML(), which returns a string containing the error nicely packaged as an HTML document. You either can print the HTML as it is or ignore the error and continue. In this example, we will just fail silently and continue.

Handling the response should be fairly straightforward given the sample HTML you've already seen in Listing 9.1. You simply need to write a loop that looks for the quote indicator strings, which are </center><pre> and </pre><center>. These strings indicate where you are in the document. You can then use regular expressions to parse out the symbol, current trading price, and other important values. The handleResponse() is implemented in Listing 9.3. The output appears in Figure 9.2.

Listing 9.3. Subroutine to extract the quote information.

sub handleResponse {

   my($response)=@_;

   my(@lines)=split(/\n/,$response->content());

   $insideQuote=0;

   foreach (@lines) {

      if ($insideQuote) {

         if (/<\/pre><center>/) {

            print "$symbol on $exchange is trading at $value on $dateTime\n";

            $insideQuote=0;

         } elsif (/^Symbol\s*:\s*(\S*)\s*Exchange\s*:\s*(.*)\s*$/) {

            $symbol=$1;

            $exchange=$2;

         } elsif (/^Last Traded at\s*:\s*(\S*)\s*Date\/Time\s*:\s*(.*)$/) {

            $value=$1;

            $dateTime=$2;

         }

      }

      if (/<\/center><pre>/) {

         $insideQuote=1;

      }

   }

} 

Of course, you can add more code to parse out other returned values, such as the 52-week low and high values. This would involve just adding another elsif block and a regular expression to match the particular pattern.

Adapting the Code for General Purpose Use

The UserAgent module can prove useful in other examples as well. The code in Listing 9.2 that retrieved the stock quote can be turned into a general purpose URL retriever. This next example does this, adding to it the ability to send the request through a firewall. The code in Listing 9.4 should look quite familiar.

Figure 9.2. Output from the getquote.pl program.

Listing 9.4. General purpose URL retriever going through a firewall.

#!/public/bin/perl5



require LWP::UserAgent;

require HTTP::Request;



$ua = new LWP::UserAgent;

$ua->proxy(`http',$ENV{`HTTP_PROXY'});



foreach $url (@ARGV) {

   $request = new HTTP::Request `GET', $url;

   $response = $ua->request($request);

   if ($response->is_success) {

      &handleResponse($response);

   } else {

      &handleError($response);

   }

}

Listing 9.4 simply replaces the forever loop with a foreach loop where the iterator is a list of URLs to retrieve. You also may have noticed the line

$ua->proxy(`http',$ENV{`HTTP_PROXY'});

This is how you can send a request through a firewall or proxy server. The mechanism used here is to define an environment variable called HTTP_PROXY. However, you could use a different approach, such as a hard-coded constant value, or the proxy server could be passed into the script as an argument. The functions handleResponse() and handleError() are left unimplemented. These are the functions that would turn this general-purpose URL retriever into something more useful such as our stock quote retriever or a Web spider, as you'll see next. That function can be specific to whatever might suit your requirements.

You'll see how this general URL retriever can be applied to useful functionality in the following examples. We will also explore some of the other powerful features that the LWP::UserAgent module provides.

Generating Web Indexes

The ability to generate a thorough Web index is a very hot commodity these days. Companies are now building an entire business case around the ability to provide Web users with the best search engine. These search engines are made possible using programs such as the one you'll see in the following example. Crawling through the Web to find all of the useful (as well as useless) pages is only one aspect of a good search engine. You then need to be able to categorize and index all of what you find into an efficient searchable pile of data. Our example will simply focus on the former.

As you can imagine, this kind of program could go on forever, so consider limiting the level of search to some reasonable depth. It is also important to abide by certain robot rules that include an exclusion protocol, identifying yourself with the User-agent field and notifying the sites that you plan to target. This will make your Web spider friendly to the rest of the Web community and will prevent you from being blacklisted. An automatic robot can cause an extremely large number of hits to occur on a given site, so please be sensitive to those sites of which you wish to obtain indexes.

Web RobotsSpiders

A Web robot is a program that silently visits Web sites, explores the links in that site, writes the URLs of the linked sites to disk, and continues in a recursive fashion until enough sites have been visited. Using the general purpose URL retriever in Listing 9.4 and a few regular expressions, you can easily construct such a program.

There are several classes available in the LWP modules that provide an easy way to parse HTML files to obtain the elements of interest. The HTML::Parse class allows you to parse an entire HTML file into a tree of HTML::Element objects. These classes can be used by our Web robot to easily obtain the title and all of the hyperlinks of an HTML document. You will first call the parse_htmlfile method in HTML::Parse to obtain a syntax tree of HTML::Element nodes. You can then use the extract_links method to enumerate all of the links or the traverse method to enumerate through all of the tags. My example will use the traverse method so we can locate the <TITLE> tag if it exists. The only other tag we will be interested in is the anchor element or the <A> tag.

You can also make use of the URI::URL module to determine what components of the URL are specified. This is useful for determining if the URL is relative or absolute.

Let's take a look at the crawlIt() function, which retrieves the URL, parses it, and traverses through the elements looking for links and the title. Listing 9.5 should look familiar--it's yet another way to reuse the code you've seen twice already.

Listing 9.5. The crawlIt() main function of the Web spider.

sub crawlIt {

   my($ua,$urlStr,$urlLog,$visitedAlready,$depth)=@_;

   if ($depth++>$MAX_DEPTH) {

      return;

   }

   $request = new HTTP::Request `GET', $urlStr;

   $response = $ua->request($request);

   if ($response->is_success) {

      my($urlData)=$response->content();

      my($html) = parse_html($urlData);

      $title="";

      $html->traverse(\&searchForTitle,1);

      &writeToLog($urlLog,$urlStr,$title);

      foreach (@{$html->extract_links(qw(a))}) {

         ($link,$linkelement)=@$_;

         my($url)=&getAbsoluteURL($link,$urlStr);

         if ($url ne "") {

            $escapedURL=$url;

            $escapedURL=~s/\//\\\//g;

            $escapedURL=~s/\?/\\\?/g;

            $escapedURL=~s/\+/\\\+/g;

            if (eval "grep(/$escapedURL/,\@\$visitedAlready)" == 0) {

               push(@$visitedAlready,$url);

               &crawlIt($ua,$url,$urlLog,$visitedAlready,$depth);

            }

         }

      }

   }

}



sub searchForTitle {

   my($node,$startflag,$depth)=@_;

   $lwr_tag=$node->tag;

   $lwr_tag=~tr/A-Z/a-z/;

   if ($lwr_tag eq `title') {

      foreach (@{$node->content()}) {

         $title .= $_;

      }

      return 0;

   }

   return 1;

}



NOTE:

In this function, all of the my() qualifiers are very meaningful. Because this function is called recursively, make sure that you don't accidentally reuse any variables that existed in the previous call to the function. Another thing to note about this function is that errors are silently ignored. You could easily add an error handler that notifies the user of any stale links found.


The other important function you need to write is getAbsoluteURL(). This function takes the parent URL string and the current URL string as arguments. It makes use of the URI::URL module to determine whether or not the current URL is already an absolute URL. If so, it returns the current URL as is; otherwise, it constructs a new URL based on the parent URL. You also need to check that the protocol of the URL is HTTP. Listing 9.6 shows how to convert a relative URL to an absolute URL.

Listing 9.6. Converting a relative URL to an absolute URL.

sub getAbsoluteURL {

   my($parent,$current)=@_;

   my($absURL)="";

   $pURL = new URI::URL $parent;

   $cURL = new URI::URL $current;

   if ($cURL->scheme() eq `http') {

      if ($cURL->host() eq "") {

          $absURL=$cURL->abs($pURL);

      } else {

         $absURL=$current;

      }

   }

   return $absURL;

}

The only remaining function besides the main program is writeToLog(). This is a very straightforward function. All you need to do is open the log file and write a line containing the title and URL. For simplicity, write each to separate lines, thus avoiding having to parse anything during lookup. All titles will be on odd-numbered lines and all URLs on the even-numbered lines immediately following the title. If a document has no title, a blank line will appear where the title would have been. Listing 9.7 shows the writeToLog() function.

Listing 9.7. Writing the title and URL to the log file.

sub writeToLog {

   my($logFile,$url,$title)=@_;

   if (open(OUT,">> $logFile")) {

      print OUT "$title\n";

      print OUT "$url\n";

      close(OUT);

   } else {

      warn("Could not open $logFile for append! $!\n");

   }

}

Now you can put this all together in the main program. The program will accept multiple URLs as starting points. You'll also specify a maximum depth of 20 recursive calls. Listing 9.8 shows the code for specifying these criteria.

Listing 9.8. Specifying the starting points and stopping points.

#!/public/bin/perl5



use URI::URL;

use LWP::UserAgent;use HTTP::Request;

use HTML::Parse;

use HTML::Element;



my($ua) = new LWP::UserAgent;

if (defined($ENV{`HTTP_PROXY'})) {

   $ua->proxy(`http',$ENV{`HTTP_PROXY'});

}

$MAX_DEPTH=20;

$CRLF="\n";

$URL_LOG="/usr/httpd/index/crawl.index";

my(@visitedAlready)=();



foreach $url (@ARGV) {

   &crawlIt($ua,$url,$URL_LOG,\@visitedAlready,0);

}



NOTE:

There is another module available called RobotRules that will make it easier for you to abide by the Standard for Robot Exclusion. This module parses a file called robots.txt in the remote directory to see find out if robots are allowed at the site. For more information on the Standard for Robot Exclusion refer to

http://www.webcrawler.com/mak/projects/robots/norobots.html




Mirroring Remote Sites

One task a Webmaster might want to automate is the mirroring of a site across multiple servers. Mirroring is essentially copying all of the files associated with a Web site and making them available at another Web site. This is done to prevent any major downtime from happening due to a hardware or software failure with the primary server. This is also done to provide identical sites across different locations in the world, so that a person in Beijing doesn't need to access a physical machine in New York but rather can access a physical machine in Hong Kong, which happens to be a mirror of the New York site.

Mirroring can be accomplished by starting at the home page of a server and recursively traversing through all of its local links to determine the files that need to be copied. Using this approach and much of the code in the previous examples, you can fairly easily automate the process of mirroring a Web site.

We will make the assumption that any link reference that is a relative URL rather than an absolute one should be considered local and thus needs to be mirrored. All absolute URLs will be considered documents owned by other servers, which we can ignore. This means that the following types of links will be ignored:

<A HREF=http://www.netscape.com>



<A HREF=ftp://ftp.netscape.com/Software/ns201b2.exe>



<A HREF=http://www.apple.com/cgi-bin/doit.pl>

However, these links will be considered local and will be mirrored:

<A HREF=images/home.gif>



<A HREF=pdfs/layout.pdf>



<A HREF=information.html>



<IMG SRC=images/animage.gif>

The LWP::UserAgent module contains a method called mirror(), which gets and stores a Web document from a server using the modification date and content length to determine whether or not it needs mirroring.

The changes you would need to make to the sample above are fairly minimal. For example, getAbsoluteURL() would be changed to return an absolute URL only for URLs local to the server you are mirroring, as shown in Listing 9.9.

Listing 9.9. Modified function to convert relative URLs to absolute URLs.

sub getAbsoluteURL {

   my($parent,$current)=@_;

   my($absURL)="";

   $pURL = new URI::URL $parent;

   $cURL = new URI::URL $current;

   if ($cURL->scheme() eq `http') {

      if ($cURL->host() eq "") {

         $absURL=$cURL->abs($pURL);

      }

   }

   return $absURL;

}

The other change would be in crawlIt(), shown earlier in Listing 9.5. Instead of writing the URL and title to the log, follow Listing 9.10 to call a subroutine called mirrorFile(), which utilizes the LWP::UserAgent mirror() method. You should also search for other file references such as the image element or <IMG> tag.

Listing 9.10. Modified crawlIt() function for mirroring a site.

sub crawlIt {

   my($ua,$urlStr,$urlLog,$visitedAlready)=@_;

   $request = new HTTP::Request `GET', $urlStr;

   $response = $ua->request($request);

   if ($response->is_success) {

      my($urlData)=$response->content();

      my($html) = parse_html($urlData);

      $title="";

      $html->traverse(\&searchForTitle,1);

      &mirrorFile($ua,$urlStr);

      foreach (@{$html->extract_links(qw(a img))}) {

         ($link,$linkelement)=@$;

         if ($linkelement->tag() eq `a') {

            my($url)=&getAbsoluteURL($link,$urlStr);

            if ($url ne "") {

               $escapedURL=$url;

               $escapedURL=~s/\//\\\//g;

               $escapedURL=~s/\?/\\\?/g;

               $escapedURL=~s/\+/\\\+/g;

               if (eval "grep(/$escapedURL/,\@\$visitedAlready)" == 0) {

                  push(@$visitedAlready,$url);

                  &crawlIt($ua,$url,$urlLog,$visitedAlready,$depth);

               }

            }

         } elsif ($linkelement->tag() eq `img') {

            my($url)=&getAbsoluteURL($link,$urlStr);

            if ($url ne "") {

               &mirrorFile($url);

            }

         }

      }

   }

}



sub searchForTitle {

   my($node,$startflag,$depth)=@_;

   $lwr_tag=$node->tag;

   $lwr_tag=~tr/A-Z/a-z/;

   if ($lwr_tag eq `title') {

      foreach (@{$node->content()}) {

         $title .= $_;

      }

      return 0;

   }

   return 1;

}



sub mirrorFile {

   my($ua,$urlStr)=@_;

   my($url)=new URI::URL $urlStr;

   my($localpath)=$MIRROR_ROOT;

   $localpath .= $url->path();

   $ua->mirror($urlStr,$localpath);

}

This example of mirroring remote sites might be useful for simple sites with only HTML files. If you have the need for a more sophisticated remote mirroring system, it would be best to use a UNIX-based replication tool like rdist for your site. If you are running a Windows NT server, there are replication tools available for these systems as well.

Summary

As you have seen in this chapter, writing user agents to automate operations that connect to Web servers can be greatly simplified using the LWP::UserAgent module. It is important to note, however, that the examples you have seen here work only with HTML documents. As Web content grows richer to include other non-text based document formats (such as PDF), it will become more important to be able to add more advanced indexing capability by leveraging work that has already been done using Perl5.