Chapter 7

String and Patterns


CONTENTS


This chapter covers some of the most important features of Perl: its string- and pattern-manipulation routines. Most of the Perl programming you do will involve strings in one form or another. It's very important to learn how to use the string search and replace operations efficiently in Perl. An inefficient search pattern can slow a script down to a crawl.

Basic String Operations

Let's first start with the basic operations in Perl for working with strings. Some of this chapter will be a rehash of what was covered in Chapters 2 through 5. Now it's time to cover the topic in detail given the background information in these chapters. I cover the following string utility functions in this chapter:

The chop() and length() Functions

To find the length of a string, you can call the length($str) function, which returns the number of characters in a string. The chop function removes the last character in a string. This is useful in removing the carriage return from a user-entered string. For an example, see Listing 7.1.


Listing 7.1. Using length() and chop().
 1 #!/usr/bin/perl
 2
 3 $input = <STDIN> ;
 4
 5 $len = length($input);
 6 print "\nLength = $len of $input before the chopping \n";
 7 chop($input);
 8 $len = length($input);
 9 print "\nLength = $len of $input after the chopping \n";
10
11 $ 7_1.pl
12 Hello! I am a Test!
13
14 Length = 20 of Hello! I am a Test! before the chopping
15
16 Length = 19 of Hello! I am a Test! after the chopping

Handling the Case in Strings

Perl provides four functions to make your life easier when handling the case of characters in a string:

lc($string) Converts a string to lowercase
uc($string) Converts a string to uppercase
lcfirst($string) Converts the first character of a string to uppercase
ucfirst($string) Converts the first character of a string to lowercase

Listing 7.2 presents sample code that illustrates how these functions work.


Listing 7.2. Using the case functions.
 1 #!/usr/bin/perl
 2 $name = "tis a test OF THE sYSTEm" ;
 3 $ucase = uc($name);
 4 $lcase = lc($name);
 5
 6 print "$name  \n";
 7 print "$ucase \n";
 8 print "$lcase \n";
 9
10 $nice = lcfirst($ucase);
11 print "lcfirst on $ucase = \n\t $nice \n";
12
13 $crooked = ucfirst($lcase);
14 print "ucfirst on $lcase = \n\t$crooked \n";

Here is the output from Listing 7.2.

tis a test OF THE sYSTEm
TIS A TEST OF THE SYSTEM
tis a test of the system
lcfirst on TIS A TEST OF THE SYSTEM =
     tIS A TEST OF THE SYSTEM
ucfirst on tis a test of the system =
    Tis a test of the system

Joining Strings Together

The dot operator is great for connecting strings. For example, the following statements will print John Hancock plus a new line:

$first="John";
$last ="Hancock";
print $first . " " . $last . "\n" ;

To print the elements of an array in a string, you can use the join function to create one long string. Here's the syntax for the join function:

join ($joinstr, @list);

The $joinstr variable is the string to use when connecting the elements of @list together. Refer to the following statements to see how to create a line for the /etc/passwd file:

@list = ("khusain","sdfsdew422dxd","501","100",
        "Kamran Husain","/home/khusain","/bin/bash");
$passwdEntry = join (":", @list);

Printing Formatted Numbers

Perl provides two functions, printf and sprintf, that behave like the printf family of functions in the C programming language. The printf function sends its output to the current file. The sprintf function takes at least two arguments, a string and a format string, followed by any other arguments. The sprintf function sends the formatted output to the string in the first argument. For example, the string $answer contains the result of the sprintf statement:

$a= 10;
$b= 10;
sprintf $answer, "%d + %d is %d and in %x", $a,$b,$a+$b,$a+$b;

Finding Substrings

A quick way to find the location of a substring in a string is to use the index function, which searches from left to right. To search from right to left, use the rindex function. Here's the syntax for these functions:

position = index ($string, $substring, [$offset]);
position = rindex ($string, $substring, [$offset]);

$string is the character string to search the $substring in. The $offset parameter is optional and defaults to the start of the string when not provided to the function. Listing 7.3 is a function that looks for the position of the word two in each line of an input file (just like grep would except that we print out the position of the character, too).


Listing 7.3. Using the index and rindex functions.
 1 #!/usr/bin/perl
 2
 3 %finds = ();
 4 $line  = 0;
 5
 6 print "\n Enter word to search for:";
 7 $word = <STDIN>;
 8 chop ($word);
 9
10 print "\n Enter file to search in:";
11 $fname = <STDIN>;
12 chop($fname);
13 open (IFILE, $fname) || die "Cannot open $fname $!\n";
14
15 while (<IFILE>) {
16     $position = index($_,$word);
17     if ($position >= 0) {
18         $finds{"$line"} = $position;
19     }
20     $line++;
21 }
22 close IFILE;
23 while(($key,$value) = each(%finds)) {
24     print " Line $key : $value \n";
25     }

This program searches for the first occurrence of the word in the file specified by the user. Each line in the file is searched for the pattern. If the pattern is found, the program prints the location of the pattern at each line and column number. The first while loop searches in a given file, and the second while lists all the items collected in the %finds associative array.

Listing 7.3 finds only the first occurrence of a pattern in a line. You can use the offset argument to search for a pattern other than from the start. The offset argument is specified from 0 and up. Listing 7.4 presents another search program that finds more than one occurrence on a line.


Listing 7.4. Searching more than once.
 1 #!/usr/bin/perl
 2
 3 %finds = ();
 4 $fname = "news.txt";
 5 $word = "the";
 6 open (IFILE, $fname) || die "Cannot open $fname $!\n";
 7
 8 print "Search for :$word: \n";
 9 while (<IFILE>) {
10     $thispos = 0;
11     $nextpos = 0;
12     while (1) {
13         $nextpos = index($_,$word,$thispos);
14         last if ($nextpos == -1);
15         $count++;
16         $finds{"$count"} = $nextpos;
17         $thispos = $nextpos + 1;
18         }
19 }
20 close IFILE;
21 print "\nLn : Column";
22 while(($key,$value) = each(%finds)) {
23     print " $key : $value \n";
24     }

The output of Listing 7.4 on a sample file would be something like this:

Ln : Column
 1 : 31
 2 : 54
 3 : 38
 4 : 53

The substr Function

The substr function is used to extract parts of a string from other strings. Here's the syntax for this function:

substr ($master, $offset, $length);

$master is the string from which a substring is to be copied, starting at the index specified at $offset and up to $length characters. Listing 7.5 illustrates the use of this function.


Listing 7.5. Using the substr function.
 1 #!/usr/bin/perl
 2 #  Check out the substr function.
 3 #
 4 $quote = "No man but a blockhead ever wrote except for money";
 5 #  quote by Samuel Johnson
 6
 7 $sub[0] = substr ($quote, 9, 6);
 8
 9 $name = "blockhead" ;
10 $pos = index($quote,$name);
11 $len = length($name);
12 $sub[1] = substr ($quote, $pos, $len);
13 $pos = index($quote,"wrote");
14 $sub[2] = substr ($quote, $pos, 6);
15
16 for ($i = 0; $i < 3; $i++) {
17     print "\$sub[$i] is \"" .  $sub[$i] . "\" \n";
18 }
19
20 #
21 # To replace a string, let's try substr on the left-hand side.
22 #
23 # Replace the words 'a blockhead', with the words 'an altruist'.
24 # (Sorry Sam.)
25 $name = "a blockhead" ;
26 $pos = index($quote,$name);
27 $len = length($name);
28
29 substr ($quote, $pos, $len) = "an altruist";
30 print "After substr = $quote \n";

The output from the code in Listing 7.5 is as follows:

$sub[0] is "t a bl"
$sub[1] is "blockhead"
$sub[2] is "wrote "

After substr = No man but an altruist ever wrote except for money

You can see how the substr operator can be used to extract values from another string. Basically, you tell the substr function how many characters you need and from where, and the chopped off portion is returned from the function.

The substr function can also be used to make substitutions within a string. In this listing, the words "a blockhead" are replaced by "an altruist". The part of the string specified by substr is replaced by the value appearing to the right of the assignment operator. Here's the syntax for these calls to substr:

substr ($master, $offset, $length) = $newStr;

$master must be a string that can be written to (that is, not a tied variable-see Chapter 6, "Binding Variables to Objects," for information on using tie() on variables). $offset is where the substitution begins for up to $length characters. The value of $offset + $length must be less than the existing length of the string. The $newStr variable can be the empty string if you want to remove the substring at the offset. To substitute the tail-end of the string starting from the offset, do not specify the $length argument.

For example, this line:

$len = 22; substr ($quote, $pos, $len) = "an altruist";

prints the following line in the previous example:

After substr = No man but an altruist

The offset can be a negative number to specify counting from the right side of the string. For example, the following line replaces three characters at the fifth index from the right side in $quote with the word "cash":

substr($quote, -5, 3) = "cash";

The substr function is great when working with known strings that do cut and paste operations. For more general strings, you have to work with patterns that can be described using regular expressions. If you are familiar with the grep command in UNIX, you already know about regular expressions. Basically, a regular expression is a way of specifying strings like "all words beginning with the letter a" or "all strings with an xy in the middle somewhere." The next section illustrates how Perl can help make these types of search and replace patterns easier.

String Searching with Patterns

Perl enables you to match patterns within strings with the =~ operator. To see whether a string has a certain pattern in it, you use the following syntax:

$result = $variable =~ /pattern/

The value $result is true if the pattern is found in $variable. To check whether a string does not have a pattern, you have to use the !~ operator, like this:

$result = $variable !~ /pattern/

Listing 7.6 shows how to match strings literally. It prints a message if the string Apple, apple, or Orange is found, or if the strings Grape and grape are not found.


Listing 7.6. Substitution with patterns.
 1 #!/usr/bin/perl
 2
 3 $input = <STDIN> ;
 4 chop($input);
 5 print "Orange found! \n" if ( $input =~ /Orange/ );
 6 print "Apple found! \n" if (  $input =~ /[Aa]pple/ );
 7 print "Grape not found! \n" if ( $input !~ /[Gg]rape/ );

So, how did you search for apple and Apple in one statement? This involves specifying a pattern to the search string. The syntax for the =~ operator is this:

[$variable =~] [m]/PATTERN/[i][o][g]

$variable is searched for the pattern in PATTERN. The delimiter of the text being searched is a white space or an end-of-line character. The i specifies a case-insensitive search. The g is used as an iterator to search more than once on the same string. The o interpolates characters. I cover all these options shortly.

Let's look at how the patterns in PATTERN are defined. If you are already familiar with the grep utility in UNIX, you are familiar with patterns.

A character is matched for the string verbatim when placed in PATTERN. For example, /Orange/ matched the string Orange only. To match a character other than a new line you can use the dot (.) operator. For example, to match Hat or Cat, you would use the pattern:

/.at/

This also matches Bat, hat, Mat, and so on. If you just want to get Cat and Hat, you can use a character class using the square brackets ([]). For example, the pattern

/[ch]cat/

will match Cat or Hat, but not cat, hat, bat, and so on. The characters in a class are case sensitive. So to allow the lowercase versions, you would use the pattern:

/[cChH]cat/

It's cumbersome to list a lot of characters in the [] class, so the dash (-) operator can define a range of characters to use. These two statements look for a digit:

/[0-9]/
/[0123456789]/

The [] operator can be used with other items in the pattern. Consider these two sample statements, which do the same thing:

/a[0123456789]/ # matches a, followed by any digit,
/a[0-9]/ # matches a, followed by any digit,
/[a-zA-Z]/ # a letter of the alphabet.

The range [a-z] matches any lowercase letter, and the range [A-Z] matches any uppercase letter. The following pattern matches aA, bX, and so on:

/[a-z][A-Z]/

To match three or more letter matches, it would be very cumbersome to write something
like this:

/[a-zA-Z][a-zA-Z][a-zA-Z]/

This is where the special characters in Perl pattern searching come into play.

Special Characters in Perl Pattern Searches

Here is a list of all the special characters in search strings (I'll go into the detail of how they work later):

The plus (+) character specifies "one or more of the preceding characters." Patterns containing + always try to match as many characters they can. For example, the pattern /ka+/ matches any of these strings:

kamran        # returns "ka"
kaamran       # returns "kaa"
kaaaamran     # returns "kaaaa"

Another way to use the + operator is for matching more than one space. For example, Listing 7.7 takes an input line and splits the words into an array. Items in the array generated by this code will not include any items generated by matching more than one consecutive space. The match / +/ specifies "one or more space(s)."


Listing 7.7. Using the pattern matching + operator.
1 #!/usr/bin/perl
2 $input = <STDIN>;
3 chop ($input);
4 @words = split (/ +/, $input);
5 foreach $i (@words) {
6     print $i . "\n";
7     }

If you do not use the + sign to signify more than one space in the pattern, you'll wind up with an array item for each white space that immediately follows a white space. The pattern / / specifies the start of a new word as soon as it sees a white space. If there are two spaces together, the next white space will trigger the start of a new word. By using the + sign, you are saying "one or more white space together" is the start of a new word.

Tip
If you are going to repeatedly search one scalar variable, call the study() function on the scalar. The syntax is study ($scalar);. Only one variable can be used with study() at one time.

The asterisk (*) special character matches zero or more occurrences of any preceding character. The asterisk can also be used with the [] classes:

/9*/    # matches an empty word, 9, 99, 999, ... and so on
/79*/   # matches 7, 79, 799, 7999, ... and so on
/ab*/   # matches a, ab, abb, abbb, ... and so on

Because the asterisk matches zero or more occurrences, the pattern

/[0-9]*/

will match a number or an empty line! So do not confuse the asterisk with the plus operator. Consider this statement:

@words = split (/[\t\n ]*/, $list);

This matches zero or more occurrences of the space, newline, or tab character. What this translates to in Perl is "match every character." You'll wind up with an array of strings, each of them one character long, of the all the characters in the input line.

The ? character matches zero or one occurrence of any preceding character. For example, the following pattern will match Apple or Aple, but not Appple:

/Ap?le/

Let's look at a sample pattern that searches the use of hashes, arrays, and possibly the use of handles. The code in Listing 7.8 will be enhanced in the next two sections. For the moment, let's use the code in Listing 7.8 to see how the asterisk operator works in pattern matches.


Listing 7.8. Using the asterisk operator.
 1 #!/usr/bin/perl
 2 # We will finish this program in the next section.
 3 $scalars =  0;
 4 $hashes =  0;
 5 $arrays =  0;
 6 $handles =  0;
 7
 8 while (<STDIN>) {
 9     @words = split (/[\(\)\t ]+/);
10     foreach $token (@words) {
11     if ($token =~ /\$[_a-zA-Z][_0-9a-zA-Z]*/) {
12               # print ("$token is a legal scalar variable\n");
13         $scalars++;
14     } elsif ($token =~ /@[_a-zA-Z][_0-9a-zA-Z]*/) {
15               # print ("$token is a legal array variable\n");
16         $arrays++;
17     } elsif ($token =~ /%[_a-zA-Z][_0-9A-Z]*/) {
18               # print ("$token is a legal hash variable\n");
19         $hashes++;
20     } elsif ($token =~ /\<[A-Z][_0-9A-Z]*\>/) {
21               # print ("$token is probably a file handle\n");
22         $handles++;
23     }
24    }
25 }
26
27 print " This file used scalars $scalars times\n";
28 print " This file used arrays  $arrays  times\n";
29 print " This file used hashes $hashes times\n";
30 print " This file used handles $handles times\n";

Lines 9 and 10 split the incoming stream into words. Note how the pattern in line 9 splits words at spaces, tabs, and in between parentheses. At line 11, we are looking for a word that starts with a $, has a non-numeric character or underscore as the first character, and is followed by an alphanumeric string or underscores.

At lines 14 and 17, the same pattern is applied, with the exception of an at (@) sign and a hash (#) sign are looked for instead of a dollar ($) sign in order to search for arrays and hashes, respectively. At line 20, the file handle is assumed to a word in all caps, not starting with an underscore, but with alphanumeric characters in it.

The previous listing can get legal names if the pattern is anywhere in a word. However, we want the search to be limited to word boundaries. For example, right now the script cannot distinguish between the following three lines of input because they all match the /\$[a-zA-Z][_0-9a-zA-Z]*/ somewhere in them:

$catacomb
OBJ::$catacomb
#$catacomb#

White spaces do not include tabs, newlines, and so on. Here are the special characters to use in pattern matching to signify these characters:

\t Tab
\n Newline
\r Carriage return
\f Form feed.
\\ Backslash (\)
\Q and \E Pattern delimiters

In general, you can escape any special character in a pattern with the backslash (\). The backslash itself is escaped with another backslash. The \Q and \E characters are used in Perl to delimit the interpretation of any special characters. When the Perl interpreter sees \Q, every character following \Q is not interpreted and is used literally until the pattern terminates or Perl sees \E. Here are a few examples:

/\Q^Section$/ # match the string "^Section$" literally.
/^Section$/   # match a line with the solitary word Section in it.
/\Q^Section$/ # match a line which ends with ^Section

To further clarify where the variable begins and ends, you can use these anchors:

\A Match at beginning of string only
\Z Match at end of string only
\b Match on word boundary
\B Match inside word

Here are some examples and how they are interpreted given a string with the word hello in it somewhere:

/\Ahel/     # match only if the first three characters are "hel"
/llo\Z/     # match only if the last three characters are "llo"
/llo$/      # matches only if the last three characters are "llo"
/\Ahello\Z/ # same as /^hello$/ unless doing multiple line matching
/\bhello/   # matches "hello", not "Othello", but also matches "hello."
/\bhello/   # matches "$hello" because $ is not part of a word.
/hello\b/   # matches "hello", and "Othello", but not "hello."
/\bhello\b/ # matches "hello", and not "Othello" nor "hello."

A "word" for use with these anchors is assumed to contain letters, digits, and underscore characters. No other characters, such as the tilde (~), hash (#), or exclamation point (!) are part of the word. Therefore, the pattern /\bhello/ will match the string "$hello", because $ is not part of a word.

The \B pattern anchor takes the opposite action than that of \b. It matches only if the pattern is contained in a word. For example, the pattern below:

/\Bhello/    

match "$hello" and "Othello" but not "hello" nor "hello." Whereas, the pattern here:

/hello\B/   

will match "hello." but not "hello", "Othello" nor "$hello". Finally this pattern

/\Bhello\B/

will match "Othello" but not "hello", "$hello" nor "hello.".

/\Bhello/    # match "$hello" and "Othello" but not "hello" nor "hello."
/hello\B/    # match "hello." but not "hello", "Othello" nor "$hello".
/\Bhello\B/  # match "Othello" but not "hello", "$hello" nor "hello.".

Listing 7.9 contains the code from Listing 7.8 with the addition of the new word boundary functions.


Listing 7.9. Using the boundary characters.
 1 #!/usr/bin/perl
 2
 3 $scalars =  0;
 4 $hashes =  0;
 5 $arrays =  0;
 6 $handles =  0;
 7
 8 while (<STDIN>) {
 9     @words = split (/[\t ]+/);
10     foreach $token (@words) {
11     if ($token =~ /\$\b[a-zA-Z][_0-9a-zA-Z]*\b/) {
12               # print ("$token is a legal scalar variable\n");
13         $scalars++;
14     } elsif ($token =~ /@\b[a-zA-Z][_0-9a-zA-Z]*\b/) {
15               # print ("$token is a legal array variable\n");
16         $arrays++;
17     } elsif ($token =~ /%\b[a-zA-Z][_0-9A-Z]*\b/) {
18               # print ("$token is a legal hash variable\n");
19         $hashes++;
20     } elsif ($token =~ /\<[A-Z][_0-9A-Z]*\>/) {
21               # print ("$token is probably a file handle\n");
22         $handles++;
23     }
24    }
25 }
26
27 print " This file used scalars $scalars times\n";
28 print " This file used arrays  $arrays  times\n";
29 print " This file used hashes $hashes times\n";
30 print " This file used handles $handles times\n";

Here is sample input and output for this program that takes an existing script file in test.txt and uses it as the input to the test.pl program.

$ cat test.txt
#!/usr/bin/perl

$input = <STDIN>;
chop ($input);

@words = split (/ +/, $input);
foreach $i (@words) {
    print " [$i] \n";
    }

$ test.pl  < test.txt
 This file used scalars 5 times
 This file used arrays  2  times
 This file used hashes 0 times
 This file used handles 1 times

Patterns do not have to be typed literally to be used in the / / search functions. You can also specify them from within variables. Listing 7.10 is a modification of Listing 7.9, which uses three variables to hold the patterns instead of specifying them in the if statement.


Listing 7.10. Using pattern matches in variables.
 1 #!/usr/bin/perl
 2
 3 $scalars =  0;
 4 $hashes =  0;
 5 $arrays =  0;
 6 $handles =  0;
 7
 8 $sType = "\\\$\\b[a-zA-Z][_0-9a-zA-Z]*\\b";
 9 $aType = "@\\b[a-zA-Z][_0-9a-zA-Z]*\\b";
10 $hType = "%\\b[a-zA-Z][_0-9A-Z]*\\b/";
11
12 while (<STDIN>) {
13     @words = split (/[\t ]+/);
14     foreach $token (@words) {
15     if ($token =~ /$sType/ ) {
16               # print ("$token is a legal scalar variable\n");
17         $scalars++;
18     } elsif ($token =~ /$aType/ ) {
19               # print ("$token is a legal array variable\n");
20         $arrays++;
21     } elsif ($token =~ /$hType/ ) {
22               # print ("$token is a legal hash variable\n");
23         $hashes++;
24     } elsif ($token =~ /\<[A-Z][_0-9A-Z]*\>/) {
25               # print ("$token is probably a file handle\n");
26         $handles++;
27     }
28    }
29 }
30
31 print " This file used scalars $scalars times\n";
32 print " This file used arrays  $arrays  times\n";
33 print " This file used hashes $hashes times\n";
34 print " This file used handles $handles times\n";

In this code, the variables $aType, $hType, and $sType can be used elsewhere in the program verbatim. What you have to do, though, is to escape the backslashes twice, once to get past the Perl parser for the string and the other for the pattern searcher if you are using double quotes. When using single quotes, you can use the following line:

$sType = '\$\\b[a-zA-Z][_0-9a-zA-Z]*\b';

instead of this line:

$sType = "\\\$\\b[a-zA-Z][_0-9a-zA-Z]*\\b";

Make sure that you remember to include the enclosing / characters when using a $variable for a pattern. Forgetting to do this will give erroneous results. Also, be sure you see how each backslash is placed to escape characters correctly.

Shortcuts for Words in Perl

The [] classes for patterns simplify searches quite a bit. In Perl, there are several shortcut patterns that describe words or numbers. You have seen them already in the previous examples and chapters.

Here are the shortcuts:

Shortcut
Description Pattern String
\d
Any digit[0-9]
\D
Anything other than a digit[^0-9]
\w
Any word character[_0-9a-zA-Z]
\W
Anything not a word character[^_0-9a-zA-Z]
\s
White space [ \r\t\n\f]
\S
Anything other than white space[^ \r\t\n\f]

These escape sequences can be used anywhere ordinary characters are used. For example, the pattern /[\da-z]/ matches any digit or lowercase letter.

The definition of word boundary as used by the \b and \B special characters is done with the use of \w and \W. The patterns /\w\W/ and /\W\w/ can be used to detect word boundaries. If the pattern /\w\W/ matches a pair of characters, it means that the first character is part of a word and the second is not. This further means that the first character is at the end of a matched word and that a word boundary exists between the first and second characters matched by the pattern and you are at the end of a word.

Conversely, if /\W\w/ matches a pair of characters, the first character is not part of a word and the second character is part of the word. This means that the second character is the beginning of a word. Again, a word boundary exists between the first and second characters matched by the pattern. Therefore, you are at the start of a word.

The quotemeta Function

The quotemeta function puts a backslash in front of any non-word character in a given string. Here's the syntax for quotemeta:

$
newstring = quotemeta($oldstring);

The action of the quotemeta string can best be described using regular expressions as

$string =~ s/(\W)/\\$1/g;

Specifying the Number of Matches

Sometimes matching once, twice, or more than once is not sufficient for a particular search. What if you wanted to match from two to four times? In this case you can use the { } operators in the search function. For example, in the following pattern you can search for all words that begin with ch followed by two or three digits followed by .txt:

/ch[0-9]{2,3}.txt/

For exactly three digits after the ch text, you can use this:

/ch[0-9]{ 3}.txt/

For three or more digits after the ch text, you can use this:

/ch[0-9]{3,}.txt/

To match any three characters following the ch text, you can use this:

/ch.{3,}.txt/

Specifying More Than One Choice

Perl enables you to specify more than one choice when attempting to match a pattern. The pipe symbol (|) works like an OR operator, enabling you to specify two or more patterns to match. For example, the pattern

/houston|rockets/

matches the string houston or the string rockets, whichever comes first. You can use special characters with the patterns. For example, the pattern /[a-z]+|[0-9]+/ matches one or more lowercase letters or one or more digits. The match for a valid integer in Perl is defined as this:

/\b\d+\b|\b0[xX][\da-fA-F]+\b/)

There are two alternatives to check for here. The first one is ^\d+ (that is, check for one or more digits to cover both octal and decimal digits). The second ^0[xX][\da-fA-F]+$ looks for 0x or 0X followed by hex digits. Any other pattern is disregarded. The delimiting \b tags limit the search to word boundaries.

Searching a String for More Than One Pattern to Match

Sometimes it's necessary to search for occurrences for the same pattern to match at more than one location. You saw earlier in the example for using substr how we kept the index around between successive searches on one string. Perl offers another alternative to this problem: the pos() function. The pos function returns the location of the last pattern match in a string. You can reuse the last match value when using the global (g) pattern matching operator. The syntax for the pos function is

$offset = pos($string);

where $string is the string whose pattern is being matched. The returned $offset is the number of characters already matched or skipped.

Listing 7.11 presents a simple script to search for the letter n in Bananarama.


Listing 7.11. Using the pos function.
1 #!/usr/bin/perl
2 $string = "Bananarama";
3 while ($string =~ /n/g) {
4         $offset = pos($string);
5         print("Found an n at $offset\n");
6 }

Here's the output for this program:

Found an n at 2
Found an n at 4
Found an n at 6
Found an n at 8
Found an n at 10

The starting position for pos() to work does not have to start at 0. Like the substr() function, you can use pos() on the right side of the equal sign. To start a search at position 6, simply type this line before you process the string:

pos($string) = 5;

To restart searching from the beginning, reset the value of pos to 0.

Reusing Portions of Patterns

There will be times when you want to write patterns that address groups of numbers. For example, a section of comma-delimited data from the output of a spreadsheet is of this form:

digits,digits,digits,digits

A bit repetitive, isn't it? To extract this tidbit of information from the middle of a document, you could use something like this:

/[\d]+[,.][\d]+[,.][\d]+[,.][\d]+/

What if there were 10 columns? The pattern would be long, and you'd be prone to make mistakes.

Perl provides a macro substitution to allow repetitions of a known sequence. Every pattern in a matched string that is enclosed in memory is stored in memory in the order it is declared. To retrieve a sequence from memory, use the special character \n, where n is an integer representing the nth pattern stored in memory.

For example, you can write the previous lines using these two repetitive patterns:

([\d]+)
([,.])

The string that is used for matching the pattern would look like this:

/([\d]+])([,.])\1\2\1\2\1\2/

The pattern matched by [\d]+ is stored in memory. When the Perl interpreter sees the escape sequence \1, it matches the first matched pattern. When it sees \2, it matches the second pattern. Pattern sequences are stored in memory from left to right. As another example, the following matches a phone number in the United States, which is of the form ###-###-####, where the # is a digit:

/\d{3}(\-))\d{3}\1\d{2}/

The pattern sequence memory is preserved only for the length of the pattern. You can access these variables for a short time, at least until another pattern match is hit, by examining the special variables of the form $n. The $n variables contain the value of patterns matched in parentheses right after a match. The special variable $& contains the entire matched pattern.

In the previous snippet of code, to get the data matched in columns into separate variables, you can use something like this excerpt in a program:

if (/-?(\d+)\.?(\d+)/) {
$matchedPart = $&;
$col_1 = $1;
$col_2 = $2;
$col_3 = $3;
$col_4 = $4;
}

The order of precedence when using () is higher than that of other pattern-matching characters. Here is the order of precedence from high to low:

() Pattern memory
+ * ? {} Number of occurrences
^ $ \b \B \W \w Pattern anchors
| The OR operator

The pattern-memory special characters () serve as delimiters for the OR operator. The side effect of this delimiting is that the parenthesized part of the pattern is mapped into a $n register. For example, in the following line, the \1 refers to (b|d), not the (a|o) matching pattern:

/(b|d)(a|o)(rk).*\1\2\3/

Pattern-Matching Options

There are several pattern-matching options in Perl to control how strings are matched. You saw these options earlier when I introduced the syntax for pattern matching. Here are the options:

g
Match all possible patterns
i
Ignore case when matching strings
m
Treat string as multiple lines
o
Only evaluate once
s
Treat string as single line
x
Ignore white space in pattern

All these pattern options must be specified immediately after the option. For example, the following pattern uses the i option to ignore case:

/first*name/i

More than one option can be specified at one time and can be specified in any order.

The g operator tells the Perl interpreter to match all the possible patterns in a string. For example, if the string bananarama is searched using the following pattern:

/.a/g

it will match ba, na, na, ra, and ma. You can assign the return of all these matches to an array. Here's an example:

@words = "bananarama" =~ /.a/g;
for $i (@words) {
    print "$i \n";
}

You can use patterns with the g option in loops. The returned value of the match is repeated until it returns false. Inside the loop you can use the &# operator. For example, in the word Mississippi, you can loop around looking for two characters together like this:

$string = "Mississippi";
while ($string =~ /([a-z]\1/g) {
          $found = $&;
          print ("$found\n");
}

Tip
Don't forget that you can use the pos() function in a while loop to see at what position the last match occurred.

The i option enables you to perform a case-insensitive search. The match will be made regardless of whether the string is uppercase or lowercase or a mixture of cases.

The m option allows searching on more than one line per match. When the m option is specified, the ^ special character matches either the start of the string or the start of any new line. Also, the $ character can match either the new line or the end of text.

The o option enables a pattern to be evaluated only once. This is never really used in practice. Basically, it forces Perl to disregard further matches on the same input line.

Normally the dot (.) character does not match the new line. When you specify the s option, you allow the pattern to be matched across multiple lines because this allows the dot character to be matched with a new line.

The x operator tells Perl to ignore any white spaces in the pattern match unless the white space has been preceded by a backslash. The real benefit to using the x option is to improve readability because pattern specifications do not have to be crunched together anymore. For example, these two patterns match the same string:

/([\d]+])([,.])\1\2\1\2\1\2/
/([\d]+])([,.]) \1\2\ 1\2\ 1\2/x

Substituting Text Through Pattern Matching

You have already seen how to substitute text through the use of the substr function. The pattern-matching function can be extended to do string substitution with the use of the s operator. Here's the syntax:

s/pattern/replacement/[options]

The replacement string is interpreted literally and cannot have a pattern. The Perl interpreter searches for the pattern specified by the placeholder pattern. If it finds the pattern, it replaces the pattern with the string represented by the placeholder replacement. Here's an example:

$string = "cabi.net";
$string =~ s/cabi/sig/;

The contents of $string will be sig.net instead of cabi.net.

The good news is that all the pattern matching stuff up to this point in the chapter applies here! So, you can use any of the pattern special characters in the substitution operator. For example, the following replaces all words with one or more digits with the letter X:

s/[\d]+/X/

Specify an empty string for the replacement if you just want to delete a set of strings. For example, the following line replaces all words with one or more digits in them:

s/[\d]+//

The pattern match memory sequence applies here. For example, to swap the two columns of data, you can use this line:

s/(\d+)\s\1/$2 $1/

The substitution pattern matches a sequence of one or more digits, followed by a space, followed by another set of digits. The output is the values of the $1 and $2 registers swapped in sequence.

The substitution operator supports several options just like the match operator:

g
Change all occurrences of the pattern
i
Ignore case in pattern
e
Evaluate replacement string as expression
m
Treat string to be matched as multiple lines
o
Only evaluate once
s
Treat string to be matched as single line
x
Ignore white space in pattern

As with pattern matching, options are appended to the end of the operator. Most of these options work the same way as they did for matching patterns during a search.

The g option changes all occurrences of a pattern in a particular string. For instance, the following substitution puts parentheses around all the numbers in a line:

s/(\d+)/($1)/g

The i option ignores case when substituting. For example, the substitution

s/\bweb\b/WEB/gi

replaces all occurrences of the words web, WeB, wEB, and so on with the word WEB.

Although you cannot put patterns in the replacement string, you can run the eval() function on it. The e option treats the replacement string as an expression, which it evaluates before replacing it in the original string. The results of the evaluation are used instead. Suppose that you wanted to repeat a string twice on a line. A common use is to redefine the values in a header file to twice what they are. For example, the string

$define ABX 123

matches all the variables of the form and replaces the numeric part of the line with twice its value. Listing 7.12 presents a simple script to do this with a C header file.


Listing 7.12. Using the pattern replace to do simple operations.
 1 #!/usr/bin/perl
 2
 3 open (FILE, "tt.h") || die $!;
 4 $i = 0;
 5 while (<FILE>) {
 6         $string = $_;
 7         if(/define/) {
 8             $string  =~ s/(\d+)/$1 * 2/e;
 9             print "$string \n";
10             $i++;
11         }
12         else {
13             print "$string \n";
14         }
15         }
16
17 close FILE;

The o option tells the Perl interpreter to replace a scalar variable only on the first match. All subsequent pattern matches are ignored.

The s option ensures that the newline character \n is matched by the . special character:

With the m option, the ^ and $ characters match the start and end of any line as they do in pattern matches.

The \A and \Z escape sequences always match only the beginning and end of the string. The actions taken by these options are not affected by the s or m options.

The x option causes the interpreter to ignore all white spaces unless they are escaped by a backslash. The only benefit gained from this operation is to make patterns easier to read. See the example for using the x option shown in the pattern-matching options section earlier in this chapter.

The forward slash (/) delimiter can be substituted with another character for showing where to delimit text. For example, you can use <>, # (hash), or () (parentheses) characters as delimiters, as illustrated in Listing 7.13.


Listing 7.13. Using a different delimiter for the forward slash.
 1 #!/usr/bin/perl
 2
 3 $name = "/usr/local/lib";
 4
 5 $s1 = $name ;
 6 $s1 =~ s#/usr/local/#/local/#;
 7 print $s1 . "\n";
 8
 9 $s2 = $name ;
10 $s2 =~ s</usr/local/></local/>;
11 print $s2 . "\n";
12
13 $s3 = $name ;
14 $s3 =~ s(/usr/local/)(/local/);
15 print $s3 . "\n";

The Translation Operator

The UNIX tr command is also available in Perl as the tr function. The tr function lets you substitute one group of characters with another. Here's the syntax:

tr/string1/string2/

where string1 contains a list of characters to be replaced, and string2 contains the characters that replace them. Each character in string2 is replaced with a character in the same position in string1.

If string1 is longer than string2, the last character of string1 is repeated to pad the contents of string2. If the same character appears more than once in string1, the first replacement found will be used.

$string = "12345678901234567890";
$string =~ tr/2345/ABC/;

Here, all characters 2, 3, 4, and 5 in the string are replaced with A, B, C, and C, respectively. The C is repeated here by Perl as it makes the length of the replacement string equal to that of the string being replaced. So, the replacement string is "ABccC" for matching with "12345".

The most common use of the translation operator is to convert a string from uppercase to lowercase, or vice versa.

while ($line = <STDIN>) {
         $line =~ tr/A-Z/a-z/;
         print ($line);
}

To convert all characters in a string to uppercase, here's another sample function:

while ($line = <STDIN>) {
         $line =~ tr/a-z/A-Z/;
         print ($line);
}

There are a few things about the tr operator that you should remember:

The program in Listing 7.14 tallies the number of times vowels are used in a text file.


Listing 7.14. Tallying vowels in a text file.
 1 #!/usr/bin/perl
 2
 3 $count = 0;
 4
 5 while ($input = <STDIN>) {
 6         chop ($input);
 7         $total += length($input);
 8         $_ = $input;
 9         $count += tr/aeiou/aeiou/;
10 }
11
12 print ("In this file, there are: $count vowels \n";

The translation operator supports three options. These options are specified after the patterns using this syntax:

tr/string1/string2/[cds]

Here are the options for tr:

c
Translate all characters not specified
d
Delete all specified characters
s
Replace multiple identical output characters with a single character

The c operator stands for complement. That is, it does the opposite of what the character specifies. For example, the following line replaces all characters that are not in [a-zA-Z0-9] with a space:

$onlyDigits =~ tr/\w/ /c;

The d option deletes every specified character:

$noDigits =~ tr/\d//d;

This deletes all the digits from $noDigits.

The s option stands for squeeze. With the s option, tr translates only one character if two or more consecutive characters translate to the same output character. For example, the following line replaces everything that is not a digit and outputs only one space between digits.

Extended Pattern Matching

Pattern-specific matching capabilities are possible with the use of this operator:

(?ccpattern)

cc is a single character representing the extended pattern-matching capability being used for the pattern. cc can be one of these values:

?: Do not store the pattern in parentheses in memory.
?o Where o can be an option to apply to the pattern and can be i for case insensitive, m for multiple lines, s for single line, or x for ignore white space.
?= Look ahead in buffer.
?! Look back in buffer.
?# Add comments.

You have seen how () stores a pattern match in memory. By using ?: you can force the pattern not to be stored in memory. In the following two statements, \1 points to \d+ in the first and [a-z] in the second:

/(\d+)([a-z]+/
/(?:\d+)([a-z]+/

The string, ?o, specifies a pattern-matching option within the pattern itself. The o could be i for ignore case. For example, the following patterns are the same:

/[a-z]+/i
/(?i)[a-z]+/

You can specify different cases for different parts of the same search pattern. Here's an example:

$pattern1 = "[A-Z]+";
$pattern2 = "(?i)[a-z0-9_]+";
if ($string =~ /$pattern1|$pattern2/) {
        ...
}

This pattern matches either any collection of uppercase letters or any collection of letters with digits and an underscore.

You can use the ?= feature to look ahead for a pattern. For example, the pattern

/123(?=XYZ)/

only matches 123 if it is immediately followed by XYZ. The matched string in $& will be 123, not 123XYZ.

To look at the back of a string, use the ?! operator. For example,

/(?!XYZ)123/

matches 123 only if it immediately follows XYZ. The matched pattern in $& will still be 123.

Reading complicated patterns is not easy, even if you are the author. Adding comments makes it easier to follow complicated patterns. Finally, you can add comments about a pattern with the ?# operator. Here's an example:

/(?i)[a-z][\d]{2,3}(?

The above example will match two or three digits following a lowercase letter.

Summary

With the function substr you can extract a substring from a string or replace a portion of a string or append to the front or back end of another string. The lc and uc functions convert strings to lowercase and uppercase. The first letter of a string can be converted to lowercase or uppercase using either lcfirst or ucfirst. The quotemeta function places a backslash in front of every nonword character in a string. New character strings can be created using join, which creates a string from the members of a list, and sprintf, which works like printf except that the output goes to a string. Functions that search character strings include index, which searches for a substring starting from the left of a string, and rindex, which searches for a substring starting from the right of a string. You can retrieve the length of a character string using length. The pos function enables you to determine or set the current pattern-matching location in a string. The tr function replaces one set of characters with another.