Chapter 12

CGI and Perl


CONTENTS

Up to this point, we have focused on those aspects of Intranet design that involve presenting data and documents to users. HTML forms, discussed in the last chapter, move us into the realm of interactive Web pages. In the next several chapters, we turn our attention to Intranet programming.

Programs can run in one of two places on an Intranet: on the server, or on the client. The trend is toward client-side execution, which gives better performance and allows local events, such as data entry or a mouse click, to directly influence execution. The chief problem with client-side techniques is lack of standards. The technology is being pioneered by individual vendors, each with a proprietary language of its own.

Examples include Netscape JavaScript and Microsoft VBScript, interpreted languages that run, respectively, within the Navigator client (2.0 or later) and Internet Explorer client (3.0 or later). Java applets are platform-independent programs that execute on a special type of client (called a virtual machine) within the Web browser. And from Microsoft we have ActiveX controls, software components similar to Java applets that can move between client and server to perform application functions.

In this chapter the emphasis is on server applications, particularly those written in Perl, a popular scripting language available on almost every platform. After reading this chapter, you'll know:

CGI Programming

Right away, our intention to run programs on the server runs into a snag. Web servers are tailored to provide HTTP services, not to perform complex processing of client data. To accomplish such tasks, the server must hand off the job to an external program called a gateway program. The standard protocol governing this exchange is called the Common Gateway Interface (CGI).

NOTE
The CGI standard is documented online at http://hoohoo.ncsa.uiuc.edu/cgi/intro.html.

Gateway programs are called from within a browser just as any other Web resource: by URL. For example:

http://www.innergy.com/myscript.pl

When the server receives the URL of a gateway program, such as myscript.pl above (the .pl extension typically indicates a Perl script), the server runs the program, using CGI to pass parameters forwarded by the client, if any. The program likewise uses CGI to return result data to the server, which forwards it, using the appropriate MIME content type, to the requesting client.

What kind of programs can the server run? Theoretically, any program that performs, reads, and writes in conformance with the CGI standard. This includes compiled executables, such as C programs or Windows DLLs, as well as interpreted scripts written in any of the Unix shells, in Perl, or, on the Macintosh, in AppleScript.

The focus here is on scripts, but the same principles govern the design and execution of binary programs. As long as the program is designed to accept input and return output in accordance with the gateway interface standard, it can cooperate with a Web server to get your work done.

Understanding the Common Gateway Interface

CGI is a standard specified and maintained by the National Center for Supercomputing Applications (NCSA). The current version, CGI/1.1, designates four distinct methods by which Web servers and gateway programs can communicate:

CGI methodBrief Description
Environment VariablesSet when the server executes a gateway program, these can be read by most scripts using native commands or operating system calls.
Command lineSearch queries are the only type of request conveyed to a gateway program via the command line.
Standard input <STDIN>Forms and other requests that pass data to a gateway program do so via standard input.
Standard output <STDOUT>Gateway programs write results to the server's standard output. This might be a document or image generated on-the-fly by the program, or instructions to the server to access another URL.

Three of these methods specify input; the last, output. Any program that runs on the Web server can serve as a gateway program, provided it conforms to the following output rules:

Content-type headers are used to return data to the browser. You'll see more of them in the section, "Generating HTML on-the-Fly." Location headers direct the browser to another Web resource once the gateway routine exits. This might be a page of HTML, such as a thank-you-for-completing-our-form message. Status headers, the rarest of the bunch, are used to return a particular HTTP status code to the calling browser. The following line is an example that tells the browser not to do anything, useful for unallocated regions in imagemaps, for instance:

Status: 204 No Response

Input can be passed to a gateway program via the environment, the command line or <STDIN>. Command-line exchanges are used exclusively for non-form queries, not discussed in this book. (The method is less important now that forms-capable browsers are everywhere.)

Forms submitted with the HTTP POST method make their data available to scripts on standard input. This process is described in the section, "Form-Processing scripts."

The set of standard CGI environment variables is listed for reference in the next section.

NOTE
Depending on whose Web server you use, you may have access to several non-standard CGI variables in addition to those shown here. Refer to your server documentation for details.

Environmental Variables

Environment variables are a general feature of most operating systems, widely supported by programming languages. They therefore represent a flexible and broadly available means of data exchange. The CGI standard specifies a set of variables supported by all Web servers, in the following list:

Variable: SERVER_SOFTWARE
Description: Name and version of the server software making the CGI call.
Example: SERVER_SOFTWARE = NCSA/1.3
Variable: SERVER_NAME
Description: The server's hostname, DNS alias, or IP address.
Example: SERVER_NAME = que.mcp.com
Variable: GATEWAY_INTERFACE
Description: Version of the CGI specification.
Example: GATEWAY_INTERFACE = CGI/1.1
Variable: SERVER_PROTOCOL
Description: Name and revision of the protocol the CGI request was issued under. For Web servers, the protocol will be HTTP; but CGI works on other server types as well (such as Gopher).
Example: SERVER_PROTOCOL = HTTP/1.0
Variable: SERVER_PORT
Description: Port number to which CGI request was sent, often 80 for HTTP.
Example: SERVER_PORT = 80
Variable: REQUEST_METHOD
Description: HTTP method used to make the CGI request, either GET, HEAD, or POST.
Example: REQUEST_METHOD = POST
Variable: PATH_INFO
Description: Extra path information appended to the calling URL by the client.
Example: A client calls the gateway program foo with the URL http://www.innergy.com/cgi-bin/foo/whitepaper/may15. The server treats everything after the program name as extra path info. Thus:
PATH_INFO = /whitepaper/may15
Variable: PATH_TRANSLATED
Description: Servers often use aliases or relative addressing to shorten pathnames. PATH_TRANSLATED gives the absolute location on the server file system of the path specified by PATH_INFO.
Example: If a gateway program is called with URL http://www.innergy.com/cgi-bin/foo/whitepaper/may15, and the server root has an absolute path /bin/httpd/docs, the translated path would be:
PATH_TRANSLATED = /bin/httpd/docs/whitepaper/may15
Variable: SCRIPT_NAME
Description: Path to the script being executed as it would be specified in an URL.
Example: /cgi-bin/foo
Variable: QUERY_STRING
Description: Information following a question mark ("?") in the calling URL. Empty if no such data is passed, or if an HTTP method other than GET is used.
Example: Say a gateway program is called with URL http://www.innergy.com/cgi-bin/foo?gordon+susan. Then:
QUERY_STRING = gordon+susan
Variable: REMOTE_HOST
Description: The hostname making the request. If the server does not have this information, it should set REMOTE_ADDR and leave this unset.
Example: REMOTE_HOST = slip-08.shore.net
Variable: REMOTE_ADDR
Description: IP address of the remote host making the request.
Example: 192.233.85.130
Variable: AUTH_TYPE
Description: The protocol-specific authentication method used to validate the user. Empty unless the server supports user authentication, and the called program requires it.
Example: AUTH_TYPE = Basic
Variable: REMOTE_USER
Description: The authenticated username. Empty unless the server supports user authentication, and the called program requires it.
Example: REMOTE_USER = aeinstein
Variable: REMOTE_IDENT
Description: Set to the remote user name retrieved. Empty unless the HTTP server supports RFC 1413 identification (identd, Unix only).
Example: REMOTE_IDENT =
6193, 23 : USERID : UNIX : stjohns
Variable: CONTENT_TYPE
Description: The MIME content type of data forwarded by the requesting client, if any. Blank unless HTTP methods POST or PUT are being used. (Actual data is available on standard input.)
Example: text/html
Variable: CONTENT_LENGTH
Description: The length of the data message forwarded by the requesting client, if any. Blank unless HTTP methods POST or PUT are being used.
Example: CONTENT_LENGTH = 23
NOTE
The foregoing environment variables are server-specific. In addition, all HTTP header information received from the client is placed into the environment, in a set of variables named with prefix HTTP_ followed by the header field name. Examples follow


Variable: HTTP_ACCEPT
Description: Comma-separated list of MIME types acceptable to the client, as indicated by the client's Accept HTTP headers.
Format: type/subtype, type/subtype
Example: HTTP_ACCEPT = image/gif, image/x-xbitmap, image/jpeg
Variable: HTTP_REFERER
Description: URL of the document from which the request originated.
Example: HTTP_REFERER = http://que.mcp.com/newbooks.htm
Variable: HTTP_USER_AGENT
Description: Browser the client is using to send the request.
Example (for a client using Netscape Navigator 2.0 for Windows 3.1):
HTTP_USER_AGENT = Mozilla/2.0 (Win16)

The best way to understand the way CGI works is to see how it's used in scripts. The most popular language for processing HTML forms is Perl. Let's see how it's done.

Introducing Perl

Great computer languages are rarely the work of committees. More often than not, they spring fully clad from the brow of a great thinker or two. Nikolaus Wirth was solely responsible for Pascal, for instance. Dennis Ritchie and James Kernighan of Bell Labs wrote C, and (in C) much of Unix. The power of Unix owes much to its rich, evolving toolset, contributed over two decades by software inventors like Richard Bourne and Aho, Weinberg and Kernighan.

Perl is a tool in this tradition. It was created by Larry Wall, a systems programmer at the Jet Propulsion Laboratory, as "a language for easily manipulating text, files, and processes"1 (Camel). Legend has it that Perl stands for Practical Extraction and Report Language-though reputable sources refer to it as a Pathologically Eclectic Rubbish Lister (Camel & Llama). You can choose for yourself once you've experienced Perl firsthand.

1 Quotes in this chapter are marked either "Camel" to denote Programming perl, by Larry Wall and Randal L. Schwartz (O'Reilly & Associates, 1991), or "Llama," to denote Learning Perl, by Randal L. Schwartz (O'Reilly & Associates, 1993).

Many Unix systems already have Perl installed. To find out if the one you're working on does, enter:

which  perl

If Perl is available, you'll get a response like /usr/local/bin/perl. If not, you'll need to download and install it. The sidebar "Getting Perl" covers the basics.

Getting Perl
The latest version of Perl (5.003 at this writing) can be downloaded from The Perl Language Web site, at http://www.perl.com/perl/info/software.html.
Perl is distributed as source code that compiles for virtually all flavors of UNIX (its native environment), VMS and OS/2. The site has links to a wide variety of what it calls "alien ports": from Atari, to MVS, to NetWare.
The definitive Perl for Windows NT is available from Hip Communications Inc.
Binaries for NT on Intel, Alpha, and PowerPC can be acquired at http://www.perl.hip.com/.
If you are installing Perl on a Windows platform, refer to the Perl for Win32 FAQ, at http://www.perl.hip.com/PerlFaq.htm.

One thing about which there can be no contention: Perl is powerful. As an example, consider the following Unix command line, which replaces all occurrences of the string que.com with que.mcp.com inside every HTML file in the current directory:

perl -pi.0 -e 's#que\.com#que\.mcp\.com#g' *.htm

The command also backs up each original file by appending the extension .0 to its name (index.htm becomes index.htm.0). If a single instruction can do this, you can imagine what longer Perl scripts can do.

The price for this economy of expression is complexity. It's not that the programming constructs in Perl are especially tricky. Like any procedural language, Perl has variables, arrays and operators, commands for manipulating files and controling program flow, and many of the syntactic niceties of C (such as '++' for auto-incrementation). Perl adds to these primitives many of its own, notably associative arrays (also called "hashes"), a type of list structure optimized for table lookups. Like almost every Perl construct these take some getting used to, but quickly become intuitive.

Newcomers to Unix will find the going tougher when it comes to Perl's extensive use of regular expressions. A regexp is a pattern to be matched against a string. Sophisticated pattern matching is a basic feature of Unix tools like grep. Perl uses this feature to enable sophisticated file editing on-the-fly. Pattern matching thus plays a big role in CGI routines and HTML form processing.

Unfortunately, learning regexp syntax is more an exercise in memorization than logic. The peculiar string of symbols in the previous example is a simple regexp. Here's another:

$var =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C", hex($1))/eg;

The strings between the slash marks are regular expressions. This line, which comes from a working CGI script, cleans up URL-encoded input by replacing characters of type "%xx" (where xx is a pair of hexadecimal digits) with an ASCII equivalent.

The good news is that this translation, like other routine CGI processing steps, has been enshrined in one or more library scripts, available free of charge on the Internet. This won't eliminate the need for you to learn regexp's, but at least you'll have expert examples to study.

Writing CGI Scripts in Perl

Let's look at a very short Perl script:


Listing 12.1. The classic program 'Hello, World' in Perl
1.     #!/usr/local/bin/perl
2.     # My 1st Perl script
3.     
4.     print "Hello, World!\n";

  1. Any line that begins with the "#"-sign is a comment-unless it's the very first line of a script. The first line starts with the characters '#!' followed by the absolute pathname of the perl interpreter on your server.
  2. A comment, ignored by Perl.
  3. A blank line, ignored by Perl.
  4. A command that writes the string "Hello, World!" followed by a newline character to <STDOUT>. Executable Perl statements must end with a semicolon.

The print statement is the most important output command from the standpoint of CGI scripting. It writes to <STDOUT> by default, but can be directed to any open filehandle as follows:

print FILEHANDLE "Your string here";

If you run Listing 10.1 off the CD-ROM, you should see the result "Hello, World!" on your screen.

NOTE
To run Perl scripts from the command line, enter:
perl script_name
The script file must be executable. Under Unix, you can make a file executable by all users with the following command:
chmod 755 script_name
To run CGI programs (Perl or otherwise) from a web browser, enter the program's URL. For instance, to launch a script name my_1st_perl in directory http://xyz.com/cgi-bin, enter:
http://xyz.com/cgi-bin/my_1st_perl
The script's permissions must allow execution by all users.

But is this a CGI script? It can't be, since it doesn't preface printed output with HTTP header information. We can fix that in a jiffy, though:


Listing 12.2. 'Hello, World' as a CGI script.
1.     #!/usr/local/bin/perl
2.     # My 1st Perl script
3.     print "Content-type: text/html\n\n";
4.     print "Hello, World!\n";

Here the MIME content type text/html is sent to the browser before any displayable data. Note the double newline ('\n\n') terminating the header; this provides the blank line required by the CGI standard. The browser interprets any output received following a content-type header as content of that type-HTML in this case.

Here's another example:


Listing 12.3 On-the-fly Web page generation.
1.     #!/usr/local/bin/perl
2.     # My 1st Perl script
3.     print "Content-type: text/html\n\n";
4.     print "<html><head><title>My 1st Perl Script</title></  
                head>\n";
5.     print "<body><h1>The Message You've Been Waiting For:</h1>\n";
6.     print "<h3>Hello, World!</h3>\n";
7.     print "<pre>\n\n\n\n\n<center>\n";
8.     print "-- This space intentionally left blank --\n";
9.     print "\n\n\n\n\n\</pre>\n 
10.     print "<hr>Your webmaster, <address>tgroup@innergy.com</
                  address>
11.     print "</body></html>\n";

When called from a browser, this script returns a bona fide Web page, shown in figure 12.1.

That's really all there is to it. Of course, Perl can do a few tricks besides printing strings to standard output. Let's take a short survey of the language, then use some of its features to work more CGI magic.

A Perl Primer

What's the best way to get your arms around a new computer language? Here we follow a divide-and-conquer strategy, breaking Perl into its essential elements. These include:

NOTE
Looking for an online reference? You're in luck: Carnegie-Mellon University maintains a hypertext version of the Perl 4 manual at http://www-cgi.cs.cmu.edu/cgi-bin/perl-man.

Once you know how Perl handles these basic constructs, you'll be poised to start reading scripts you find on the Web and customizing them to do your bidding. Then, with a little practice (okay, a lot of practice), you'll be writing Perl scripts of your own.

Data and Operators in Perl

Perl distinguishes three types of data: scalars, arrays, and associative arrays. Scalar data, the simplest kind, includes numbers and strings. Perl scalars begin with a dollar sign:

$numeric_var = 4;
$big_number = 4000000;
$tiny_number = 3.14E-12;     # 0.00000000000314
$string = "Bartelby the Scrivner";
$char = 'c';

The difference between single- and double-quoted strings is important in Perl. Single quotes are stronger; almost any character appearing between them is to be taken literally. There are two exceptions. To indicate a single-quote mark in a string, preceed it with a backslash: \'. To indicate a backslash, preceed it with a backslash: '\\'. For example:

$single = 'this_so-called \'string\' contains_a_single_quote';
$backslash = 'this string contains a backslash: \\';

Double quotes can contain special control characters, similar to those in the C language:

Construct
Meaning
\n
newline
\r
carriage return
\t
tab
\b
backspace
\cZ
control character (here, ^Z)
\\
backslash
\"
double-quote mark
\l
lowercase next letter
\L
lowercase letters until \E
\u
uppercase next letter
\U
uppercase letters until \E
\E
end \L or \U

In addition, double-quoted strings are variable interpolated- variable names are replaced with their current values when the strings are used. For instance:

$x = 4;
$y = 'eva';
$z = "Love you $x$y";

assigns "Love you 4eva" to variable $z.

Scalar operators in Perl include numeric arithmetic (+, -, *, /), exponentiation (**) and a modulus operator (%). The obvious numeric comparisons are available (<, <=, ==, >=, >, !=). In addition, Perl has the friendly autoincrement (++) and autodecrement (--) operators from C. For example:

$i = 9; $j = 0; 
$i + $j;      # equals 9
$i ** 3;     # is nine cubed, or 81
$i + ++$j;     # equals 10, since $j is prefix autoincremented
$i %  5;     # equals 4, the remainder of 9/5

In the first line variable $i and $j are assigned values. Line two shows addition, line three exponentiation. In line four, variable $j is incremented before adding it to $i. Line five shows the modulus operator in action.

String operators include concatenation, signified by a dot:

"World" . "Wide" . "Web"     # equals "WorldWideWeb"

and repetition, signified by a lowercase 'x':

"Magic" x 4     # is "MagicMagicMagicMagic"

Arrays are ordered lists of scalars. Array literals are enclosed in parenentheses, while array variables begin with an at-sign ('@'):

@mylist = (1, 2, 3);     # assigns three elements to variable  @mylist
@yourlist = @mylist;     # copies three elements to @yourlist
$mylist[1];          # is scalar 2 (indexing starts at zero)

The first line assigns three scalar elements to the array @mylist. The second line copies these elements to another array variable, @yourlist. The third line shows how elements in an array can be referenced by index. The first element of a Perl array has index "0"; hence, the second element has index "1".

One of Perl's conveniences is that array variables can be used without declaration or initialization; they can shrink to zero or grow to the size of available memory as items are removed or added.

By convention, the index of the last element of array @list is given by the scalar $#list. Perl offers a number of useful array operators as well. To use an array like a stack, adding and deleting elements as needed, there are the LIFO operators push() and pop(). Conversely, to manipulate an array like a queue, use the FIFO operators shift() and unshift().

A third major data type is a special class of array called an associative array (or hash). What makes hashes special is their indexing scheme. Instead of a numeric value counting a certain number of elements into the array, associative indexes are strings (called keys) mapped one-to-one to the elements (called values). The effect is a data store optimized for lookups.

Perl provides two useful operators for associative arrays. The keys() operator returns a list of keys, while values() returns the elements.

NOTE
In Perl, the variables $var, @var and %var are not only of different types, but are completely unrelated. This expands the namespace at some risk of confusion, but the meaning is always clear from the notation.

Hash variables begin with a percent sign (%), and are indexed with curly braces rather than square brackets. Here are some examples:

%java = ("strong", "kona", "medium", "kenya", "light", "carib");
$java{ "medium" }; # equals "kenya"
$java{ "extra" } = "sumatra";     # creates new key "extra" with value "sumatra"
@strength = keys( %java ); # assigns list ("extra", "strong",  ..., "light") to @strength
$strength = "extra"; print $java{ $strength };     # prints  "sumatra"
print $java{ $strength[0] }; # also prints "sumatra",
since 1st element of @strength is "extra"

The first line defines a hash, %java, that associates a list of adjectives (strong, medium, light) with a list of coffee types (kona, kenya, carib). The second line shows how values are referenced; the keyword "medium" selects the value "kenya", for instance. In the third line, a new value, "sumatra", is added to the array simply by associating it with the keyword "extra". Perl allocates the required memory on its own.

The fourth line of the example shows how the Perl function keys() operates on hash %java to create a regular array, @strength, containing the hash's keys. The fifth and sixth lines show different ways of referencing the same value. Note that the variable $strength[0] is the first element of the (regular) array @strength, which in line four of the example was made equal to "extra".

Control Structures in Perl

In its control notation, Perl borrows even more heavily from C than elsewhere. Those familiar with C can, therefore, forge ahead with confidence (excepting the switch{} statement, which Perl lacks). Others familiar with structured programming, but not with C per se, should find the material intuitive.

Structured programming is enabled by blocks, conditionals, and loops. Let's look at how Perl implements each of these in turn.

Any group of Perl statements enclosed in curly braces becomes a statement block, which can play the role of a single statement. The standard conditional is the if-then-else construct. Conditions must be enclosed in parentheses, and blocks require curly braces. Another conditional is unless, which behaves inversely to if. The following example illustrates these concepts:


Listing 12.4 Using Perl conditionals.
if ($name eq "Microsoft") {
     print "Welcome to Big Green!\n";
     $ms++;
} elsif ($name eq "IBM") { # elsif is a Perlism for 'else if'print
     "Welcome to Big Blue!\n";
     $ib++;
} else {
     print "Welcome to Burger King!\n";
     $bk++;          # count visitors
}
print "Total visits:\n";
print "Microsoft\t$ms\n";
print "IBM\t\t$ib\n";
print "Burger King\t$bk\n";


Listing 12.5 The UNLESS statement.
unless ($x < 0) {
     print "$x is non-negative\n";
     $x--;     # decrease by 1
}

Few programs execute linearly from start to finish without some sort of iteration, or looping. Perl has several looping constructs, summarized below.

The while statement repeats a block as long as a given condition remains true. The condition is tested by evaluating an expression once per loop. It's used as follows:

LABEL: while (EXPR) {
	# code to repeat goes here
}

The optional LABEL facilitates control of nested loops.

A similar loop construct, until, repeats as long as a given expression is false.

LABEL: until (EXPR) {
	# code to repeat goes here
}

Note that for both while and until, the truth value of the expression EXPR is tested before the first iteration. To execute the loop once before testing, use the do command:

do {
	# code to repeat goes here
} until (EXPR);

The following examples show how these constructs are used.


Listing 12.6 The Camel Countdown using while.
#!/usr/local/bin/perl
$countdown = 10;
while ($countdown > 0) {
	print "T minus $countdown seconds and counting ...\n";
	sleep 1;	# Perl command that causes program to pause for  1 second
	--$countdown;	# decrement count by one
}
print "BLAST OFF!\n";


Listing 12.7 The Camel Countdown using until.
#!/usr/local/bin/perl
$countup = 0;
until ($countup == 10) {
	# In this example we use a variant of the print command, printf,
	# which can print the results of expressions
	printf ("T minus, %d seconds and counting ...\n", 10-$countup);
	sleep 1;	# Perl command that causes program to pause for  1 second
	++$countup;	# increment count by one
}
print "BLAST OFF!\n";


Listing 12.8 The Camel Countdown using do.
#!/usr/local/bin/perl
$countdown = 10;
do {
	print "T minus $countdown seconds and counting ...\n";
	sleep 1;	# Perl command that causes program to pause for  1 second
} until (--$countdown == 0);
print "BLAST OFF!\n";

Another looping construct, the for statement, has the form:

for (EXPR1; EXPR2; EXPR3) {
	# code to repeat goes here
}

Here, the loop is executed while EXPR2 is true. EXPR1 can be used to initialize a counter (e.g., $i = 0), while EXPR3 performs an operation (e.g., $i++). Together they look like this:


Listing 12.9 The Camel Countdown using for.
#!/usr/local/bin/perl
I
for ( $i=10; $i>0; $i-- ) { 	#repeats ten times, with $i=10..1	
print "T minus $i seconds and counting ...\n";
	sleep 1;		# Perl command that causes program to pause for 1 second
} 

print "BLAST OFF!\n";A somewhat different tool for iterating code is foreach, which has the following syntax:

foreach VAR (ARRAY) {
	# code to repeat goes here
}

This construct iterates once for each element in ARRAY. VAR takes the value of the current element. No condition is tested.

foreach is useful for printing out associative arrays, as shown here:

print "If you like your java ...\n";
foreach $strength (keys %java) {
     print "$strength, try $java{ $strength }\n";
}

Recall that the keys of hash %JAVA are adjectives like "medium", and that the values are types of coffee, like "kenya". This example uses the Perl function KEYS() to assign each key in turn to the scalar $STRENGTH, which is used iteratively to print output until the hash is run through.

Loops are often used in combination with input and output operations, discussed next.

Simple File Operations in Perl

How does a Perl script read from standard input? Very simply, it turns out. Every Perl process comes with pre-defined filehandles STDIN, STDOUT, and STDERR. To read from STDIN, use the following syntax:

$line = <STDIN>;      # reads one line of input

Alternatively, you can read a sequential set of lines by assigning <STDIN> to an array:

@lines = <STDIN>;      # reads all lines up to EOF (CTRL-Z)

Perl is craftier than these simple assignments imply. Consider the following widely-used loop:

while (<STDIN>) {     # iterate until no more input
     chop;          	# chops terminating newline off input buffer
     M			# other commands
}

In this example, the while statement is used to repeat a sequence of commands as long as more input is available. The loop terminates when STDIN receives EOF.

Perl stores the lines it reads from input in a special, built-in variable, $_. By default, functions operate on $_ unless another variable is explicitly specified. For instance, the CHOP() function in the example removes the last character of the data passed to it; here, it is used to remove the trailing newline (\n) on the line of input to be processed. To chop the end off a different variable- say, $OneByteTooLong-you would write

chop( $OneByteTooLong );

Often, of course, you will need to work with files other than standard I/O. Use the Perl open command to open an arbitrary file. The syntax is:

open( FILEHANDLE,EXPR )

where EXPR specifies the filename, associated after opening with FILEHANDLE. EXPR can be a literal filename, such as /HOME/WEB/SPARKY/FAQ.HTML or an expression that resolves to a filename.

open(HOSTS, ">hostfilename");     # opens file for writing
open(LOG, ">>logfile");     # opens file for appending

There is also a close command, but because Perl cleans up after itself, few programmers make use of it.

Perl Built-in Functions

Perl comes with a large set of functions that can be referenced without explicit operating system calls. Besides the savings in overhead this entails, Perl's built-in functions are highly optimized and sometimes outperform the equivalent system ones. (A good example is grep, the Unix pattern searching tool.)

By far the most important functions in Perl's arsenal are those related to pattern matching. Two, in particular, stand out: Match and Substitute.

The Match function, shown below, searches a string for the specified regular expression.

/REGEXP/;     # match function
m!REGEXP!;     # alternate form; any pair of delimiters permitted

What string is searched? By default, the string stored in Perl's magic placeholder, $_.The following example shows how to search a set of input lines for HTML comments -- lines that begin with "<!--" and end in "-->":

while (<STDIN>) {     # iterate until no more input
     chop;          	# chops terminating newline off input buffer
     if (/<!--\s*(.+)\s*-->/) {     # find HTML comments
          print "Comment: $1\n";
     }
}

The Substitute function, which extends Match to Match-and-replace, looks like this:

s/REGEXP/REPLACEMENT/;     # string is searched for REGEXP, which, if found, is replaced

Like Match, Substitute operates by default on the magic variable $_. As an example, say you wanted to precede all the HTML comments in a web page with a set of author's initials, gb. You could use the substitute function as follows:

while (<STDIN>) {     		# iterate until no more input
     	chop;          	# chops terminating newline off input buffer
 	s/(<!--\s*)(.+)(s*-->)/\1 gb: \2\3/;
}

This example illustrates another powerful feature of Perl pattern matching: backreference. Enclosing parts of a pattern to be matched causes Perl to memorize those parts under the labels \1, \2, ... , up to \9. In the example above, "<!--" is found and memorized as \1; the comment text, represented by the regular expression ".+", is found and stored as \2; and the trailing "-->" is found and memorized as \3. These references are then used in the replacement expression to insert the initials gb.

What if you need to apply the match or substitute functions to a variable other than $_? You could set $_ equal to the target variable, but Perl has an easier way. The symbol "=~", called the pattern binding operator, applies a Perl pattern-matching function to any variable. Here's how it's used:

$url =~ s/(que)(\.com)/\1\.mcp\2/;     # uses regexp backreference

This example finds occurences of the string "que.com" in the variable $URL and replaces them with "que.mcp.com" using backreference.

That's just scratching the surface of this rich langauge. Besides pattern-matching functions, Perl offers a respectable set of commands including mathematical, string manipulating, I/O, system interaction, and networking functions. Systems having DBM (a Unix database engine) can take advantage of Perl's data manipulation functions as well.

Perl CGI Scripts Revisited

At this point, you have the basics of Perl and CGI scripting under your belt. That means you're ready to appreciate the gory details of real-world gateway programs written in Perl.

In this section, you'll learn to process forms and generate Web pages on-the- fly using Perl.

Form-Processing Scripts

The aim of HTML form processing is to recover the name/value pairs entered at the browser and submitted to a script, and to process this data appropriately.

The first step toward recovering the name/value pairs is decoding the input stream. This is necessary because a form sends data not in plain text, but in a format called URL encoding, required to ensure all characters transfer properly over the network. (Without encoding, certain message characters might masquerade as network control characters, garbling the transfer.)

NOTE
You can see the URL-encoded output of a form by changing the form's ACTION attribute to a mailto URL that points to your e-mail address. On submitting the form, you'll be sent a mail message like this:

ThisForm=Survey&Name=Max+Headroom&Age=None

Here are the rules for URL-encoding a form's name/value pairs:

Decoding form input amounts to reversing these steps. Perl regular expressions make short work of such manipulations-if you remember how to use them. Fortunately, script libraries exist on the Web to solve this and other routine problems. One good library for HTML form processing is CGI-LIB.PL.

NOTE
CGI-LIB.PL. is Copyright 1994 by Steven E. Brenner. You'll find additional info at http://www.bio.cam.ac.uk/web/form.html or http://www.seas.upenn.edu/~mengwong/forms.

To include foreign code such as a sub-routine library in your Perl script, use the Require statement. Add a line at the top of your script (but below any leading comments) as follows:

require 'cgi-lib.pl';

The package contains several subroutines, the most important of which is the &ReadParse routine. According to the library code:

# ReadParse
# Reads in GET or POST data, converts it to unescaped text, and  puts
# one key=value in each member of the list "@in"
# Also creates key/value pairs in %in, using '\0' to separate  multiple
# selections

In other words, &ReadParse handles form data transmitted with either the GET or POST method, takes care of URL decoding, and parses the form's name/value pairs into a Perl associative array (%in). Once &ReadParse is called, therefore, field names from the form can be used as keys to look up the corresponding content.

Other CGI-LIB.PL routines determine whether GET or POST is being used, insert a TEXT/HTML content-type header, perform CGI error handling, and print out a listing of name/value pairs retrieved from the form.

There are some remarkably accomplished Perl scripts on the Web, most available cost-free. Look to the following URLs for valuable routines and a glimpse at the power of expert Perl.

URL SiteScripts Offered
http://www.perl.com/perl/index.html The closest thing to an official Perl Web site.
You'll find the
Comprehensive Perl
Archive Network (CPAN)
here, plus FAQs, USENET
links, and all known
ports of Perl itself.
http://www-genome.wi.mit.edu/
ftp/pub/software/WWW/
Lincoln Stein's superb
collection of Perl 5
modules for
CGI processing.
http://worldwidemart.com/scripts/ Matt's Script Archive
offers simple,
configurable back-ends
for mailing form results,
keeping a GuestBook, and
hosting on-line
discussion forums.

Generating HTML on-the-Fly

In Listing 12.3, you saw a brief Web page generated by a CGI script. Creating more complex pages is really no different, but you need to be aware of a few Perlisms.

Suppose you want to embellish the page shown in Figure 12.1 with a graphic or two. The HTML to do this is straightforward (assuming you know the URLs of the desired elements). But wait. The tag for placing an in-line image contains double-quote marks, as follows:

<IMG SRC="image_file" ALT="image_description">

Figure 12.1 : HTML response generated by the Perl script of Example 3.

What do you think happens if we generate this line with the Perl code shown below?

print "<img src="pix/goldline.gif" alt="golden line">\n";

What happens is that Perl chokes on excess quote marks. The interpreter can't distinguish between those to be printed and those that specify the print list.

To include quote marks or other characters special to Perl in output generated on-the-fly, you must escape each offending character by preceding it with a backslash ("\"). Characters requiring escape within a Perl PRINT statement include double- and single-quote marks, the dollar sign ("$"), percent sign ("%"), at-sign ("@"), square brackets ("[]"), and the backslash character itself. A CGI script that prints such characters might look like this:


Listing 12.10 CGI script that returns HTML containing characters special to Perl.
1.     #!/usr/local/bin/perl
2.     # Yet Another Perl script
3.     require 'cgi-lib.pl';
4.     
5.     &PrintHeader;     # inserts appropriate content type
6.     print "<html><head><title>Perl Example 4: On-the-Fly Web Page</title></head>\n";
7.     print "<body><h1>The Message You've Been Waiting For:</h1>\n";
# Escaped quote marks in next line:
8.     print "<h3 align=\"center\">Hello, World!</h3>\n";
9.     print "<pre>\n\n\n\n\n<center>\n";
# Escaped quote marks in next line:
10.     print "<img src=\"pix/goldline.gif\" alt=\"gold line\">\n";
10.     print "<img src=\"pix/westhemi.gif\" alt=\"World (western hemishphere)\">\n";
10.     print "<img src=\"pix/goldline.gif\" alt=\"gold line\">\n";
11.     print "\n\n\n\n\n\</center></pre>\n"; 
# Escaped at-sign in next line:
12.     print <hr>Your webmaster,
<address>gbenett\@innergy.com</address>\n;
13.     print "</body></html>\n";

Figure 12.2 shows how Netscape Navigator displays the resulting page.

Figure 12.2 : HTML page generated on the fly by Perl Example 4.

Server-Side Includes

An include is a bit of code or text inserted in the body of a document as it is being processed. Programmers have been using includes for decades to modularize their code. Server-side includes are an extension to HTTP that enables HTML authors to embed executable commands in their Web pages. SSI commands execute on the server.

NOTE
The essential reference for SSI is at NCSA's Web site http://hoohoo.ncsa.uiuc.edu/docs/tutorials/includes.html.

SSI is the simplest form of interaction with a Web server. It makes possible, with single lines of self-explanatory code, effects much trickier to achieve with CGI scripts. For instance, the following code snippet causes the current date/time to be displayed on a Web page:

<p>
It's <!--#echo="DATE_LOCAL"--> in server land.
</p>

Those familiar with HTML will recognize what looks like a comment (anything set off with <!--comment here-->) in the middle of the text. That's all there is to SSI. Insert one of the six allowable commands from the SSI specification, preceded by '#', into an HTML comment field. Moreover, you can reference the full set of CGI environment variables, plus an extended SSI environment, which includes DATE_LOCAL and other convenient system data. The complete list of environment options, with descriptions, is posted on NCSA's Web site.

This seems like a pretty good way to identify a visitor's IP address, report the current time, or check for a certain browser version. And it is. But as with all neat tricks, there's a catch.

Two catches, actually. SSI loads the Web server fiercely. It requires the server to scan each document for SSI code before transporting it. On top of that, it's the least secure of all the interactive modes. SSI in effect enables users to run embedded code on the server without restriction.

To mitigate these risks, servers like NCSA HTTP provide SSI-specific configuration options for administrators. One such option makes it possible to disable the most dangerous SSI command, "exec". (This is the one that lets users run loose.) Administrators can also enable SSI in some directories and not others, permitting only trusted users to create Web pages with executable code included. (SSI code in an unauthorized Web page is ignored.) Finally, to alleviate the parsing burden on the server, a configuration option exists to define which files should be scanned for SSI code. The following line, for example, tells the Web server to parse only files ending in *.SHTML:

AddType text/x-server-parsed-html .shtml

The NCSA documentation contains additional details.

Script Security

CGI scripting and the easy programming interface it provides are among the most attractive features of an intranet. Unfortunately, they're also the greatest contributors of security risk in a Web-based network.

The problem lies not with the Common Gateway Interface itself, but with the power it gives to CGI script authors and, potentially, to users. The burden of establishing secure CGI guidelines falls on the server administrator. This section tells you what to look out for.

TIP
Remove shells and interpreters from the server that you don't intend to use. For example, if you don't run Perl-based CGI scripts, remove the Perl interpreter.

There are two types of risks associated with scripts. One is the inadvertent disclosure of server information, such as password or registry files, that could be used to further subvert security measures. The other is the potential that users can spoof the script into doing something perverse, like executing system commands.

You can take several precautions to lower the risk of running CGI scripts on your Web server:

  1. Keep all CGI scripts in a single directory (e.g., /cgi-bin) that only the Web administrator can write to.
  2. If possible, use compiled executables rather than Perl scripts, and avoid shell scripts altogether for CGI processing. (This includes *.BAT programs on NT-based servers.)
  3. Never trust input data. Variables populated from a web form can contain strings that break unwary scripts, causing them to execute unauthorized operations. In Perl, data of unknown pedigree is called "tainted." If the variable $SCARY contains tainted data, for instance, the following routine can be used to create an "untainted" copy in the variable $COOL:
$SCARY =~ /^([\w.]*)$/;	# matches only alphanumeric characters  and dots
$COOL =~ $1;			# use $COOL for remainder of  program
  1. If you must allow non-alphanumeric characters, here's a filter to escape potential metacharacters:
s/([;<>\*\|'&\$!#\(\)\[\]\{\}:'"])/\\$1/g
TIP
If you're using Perl 5, test your scripts using perl -T to invoke the "taint" checking option.

  1. You may wish to investigate safecgiperl, a modified interpreter being developed by Malcolm Beattie to enable secure Perl execution. Visit http://users.ox.ac.uk/~mbeattie/perl.html to learn more.
  2. Check out the following Web sites for additional details on safe scripting:

Document TitleURL
The WWW Security FAQ<http://www-genome.wi.mit.edu/WWW/faqs/www-security-faq.html>
Safe CGI Programming<http://www.cerf.net/~paulp/cgi-security/ safe-cgi.txt>
CGI Security Tutorial <http://csclub.uwaterloo.ca/u/mlvanbie/cgisec/>