Halaman

Selasa, 11 Agustus 2015

Regular Expressions

You can use a regular expression to find patterns in strings: for example, to look for a specific name in a phone list or all of the names that start with the letter a. Pattern matching is one of Perl's most powerful and probably least understood features. But after you read this chapter, you'll be able to handle regular expressions almost as well as a Perl guru. With a little practice, you'll be able to do some incredibly handy things.
There are three main uses for regular expressions in Perl: matching, substitution, and translation. The matching operation uses the m// operator, which evaluates to a true or false value. The substitution operation substitutes one expression for another; it uses the s/// operator. The translation operation translates one set
of characters to another and uses the tr/// operator. These operators are summarized in Table 10.1.
Table 10.1 - Perl's Regular Expression Operators
Operator Description
m/PATTERN/ This operator returns true if PATTERN is found in $_.
s/PATTERN/REPLACEMENT/ This operator replaces the sub- string matched by PATTERN with REPLACEMENT.
tr/CHARACTERS/REPLACEMENTS/ This operator replaces characters specified by CHARACTERS with the characters in REPLACEMENTS.
All three regular expression operators work with $_ as the string to search. You can use the binding operators (see the section "The Binding Operators (=~ and !~)" later in this section) to search a variable other than $_.
Both the matching (m//) and the substitution (s///) operators perform variable interpolation on the PATTERN and REPLACEMENT strings. This comes in handy if you need to read the pattern from the keyboard or a file.
If the match pattern evaluates to the empty string, the last valid pattern is used. So, if you see a statement like print if //; in a Perl program, look for the previous regular expression operator to see what the pattern really is. The substitution operator also uses this interpretation of the empty pattern.
In this chapter, you learn about pattern delimiters and then about each type of regular expression operator. After that, you learn how to create patterns in the section"How to Create Patterns" .. Then, the "Pattern Examples" section shows you some situations and how regular expressions can be used to resolve the situations.

Pattern Delimiters

Every regular expression operator allows the use of alternative pattern delimiters. A delimiter marks the beginning and end of a given pattern. In the following statement,
m//;
you see two of the standard delimiters - the slashes (//). However, you can use any character as the delimiter. This feature is useful if you want to use the slash character inside your pattern. For instance, to match a file you would normally use:
m/\/root\/home\/random.dat/
This match statement is hard to read because all of the slashes seem to run together (some programmers say they look like teepees). If you use an alternate delimiter, if might look like this:
m!/root/home/random.dat!
or
m{/root/home/random.dat}
You can see that these examples are a little clearer. The last example also shows that if a left bracket is used as the starting delimiter, then the ending delimiter must be the right bracket.
Errata Note
The printed version of this book shows the above examples as m!\/root\/home\/random.dat! and as m{\/root\/home\/random.dat}. While I was writing the book it did not occur to be that the / character was not a metacharacter and only needed to be escaped because of the delimiters. Obviously, if the / character is the delimiter, it needs to be escaped in order to use it inside the pattern. However, if an alternative delimiter is used, it no longer needs to be escaped. - this fact was pointed out to me by Garen Deve.
Both the match and substitution operators let you use variable interpolation. You can take advantage of this to use a single-quoted string that does not require the slash to be escaped. For instance:
$file = '/root/home/random.dat';
m/$file/; 
You might find that this technique yields clearer code than simply changing the delimiters. If you choose the single quote as your delimiter character, then no variable interpolation is performed on the pattern. However, you still need to use the backslash character to escape any of the meta-characters discussed in the "How to Create Patterns" section later in this chapter.

Tip
I tend to avoid delimiters that might be confused with characters in the pattern. For example, using the plus sign as a delimiter (m+abc+) does not help program readability. A casual reader might think that you intend to add two expressions instead of matching them.
Caution
The ? has a special meaning when used as a match pattern delimiter. It works like the / delimiter except that it matches only once between calls to the reset() function. This feature may be removed in future versions of Perl, so avoid using it.
The next few sections look at the matching, substitution, and translation operators in more detail.

The Matching Operator (m//)

The matching operator (m//) is used to find patterns in strings. One of its more common uses is to look for a specific string inside a data file. For instance, you might look for all customers whose last name is "Johnson" or you might need a list of all names starting with the letter s. The matching operator only searches the $_ variable. This makes the match statement shorter because you don't need to specify where to search. Here is a quick example:
$_ = "AAA bbb AAA";
print "Found bbb\n" if  m/bbb/;
The print statement is executed only if the bbb character sequence is found in the $_ variable. In this particular case, bbb will be found, so the program will display the following:
Found bbb
The matching operator allows you to use variable interpolation in order to create the pattern. For example:
$needToFind = "bbb";
$_ = "AAA bbb AAA";
print "Found bbb\n" if  m/$needToFind/;
Using the matching operator is so commonplace that Perl allows you to leave off the m from the matching operator as long as slashes are used as delimiters:
$_ = "AAA bbb AAA";
print "Found bbb\n" if  /bbb/;
Using the matching operator to find a string inside a file is very easy because the defaults are designed to facilitate this activity. For example:
$target = "M";

open(INPUT, "<findstr.dat");

while (<INPUT>) {
     if (/$target/) {
         print "Found $target on line $.";
     }
}
close(INPUT);
Note
The $. special variable keeps track of the record number. Every time the diamond operators read a line, this variable is incremented.
This example reads every line in an input searching for the letter M. When an M is found, the print statement is executed. The print statement prints the letter that is found and the line number it was found on.

The Matching Options

The matching operator has several options that enhance its utility. The most useful option is probably the capability to ignore case and to create an array of all matches in a string. Table 10.2 shows the options you can use with the matching operator.
Table 10.2 - Options for the Matching Operator
Option Description
g This option finds all occurrences of the pattern in the string. A list of matches is returned or you can iterate over the matches using a loop statement.
i This option ignores the case of characters in the string.
m This option treats the string as multiple lines. Perl does some optimization by assuming that $_ contains a single line of input. If you know that it contains multiple newline characters, use this option to turn off the optimization.
o This option compiles the pattern only once. You can achieve some small performance gains with this option. It should be used with variable interpolation only when the value of the variable will not change during the lifetime of the program.
s This option treats the string as a single line.
x This option lets you use extended regular expressions. Basically, this means that Perl will ignore whitespace that's not escaped with a backslash or within a character class. I highly recommend this option so you can use spaces to make your regular expressions more readable. See the section "Example: Extension Syntax" later in this chapter for more information.
All options are specified after the last pattern delimiter. For instance, if you want the match to ignore the case of the characters in the string, you can do this:
$_ = "AAA BBB AAA";
print "Found bbb\n" if  m/bbb/i;
This program finds a match even though the pattern uses lowercase and the string uses uppercase because the /i option was used, telling Perl to ignore the case. The result from a global pattern match can be assigned to an array variable or used inside a loop. This feature comes in handy after you learn about meta-characters in the section called "How to Create Patterns" later in this chapter.

The Substitution Operator (s///)

The substitution operator (s///) is used to change strings. It requires two operands, like this:
s/a/z/;
This statement changes the first a in $_ into a z. Not too complicated, huh? Things won't get complicated until we start talking about regular expressions in earnest in the section "How to Create Patterns" later in the chapter. You can use variable interpolation with the substitution operator just as you can with the matching operator. For instance:
$needToReplace   = "bbb";
$replacementText = "1234567890";
$_ = "AAA bbb AAA";
$result = s/$needToReplace/$replacementText/;
Note
You can use variable interpolation in the replacement pattern as shown here, but none of the meta-characters described later in the chapter can be used in the replacement pattern.
This program changes the $_ variable to hold "AAA 1234567890 AAA" instead of its original value, and the $result variable will be equal to 1 - the number of substitutions made.
Frequently, the substitution operator is used to remove substrings. For instance, if you want to remove the "bbb" sequence of characters from the $_ variable, you could do this:
s/bbb//;
By replacing the matched string with nothing, you have effectively deleted it. If brackets of any type are used as delimiters for the search pattern, you need to use a second set of brackets to enclose the replacement pattern. For instance:
$_ = "AAA bbb AAA";
$result = s{bbb}{1234567890};

The Substitution Options

Like the matching operator, the substitution operator has several options. One interesting option is the capability to evaluate the replacement pattern as an expression instead of a string. You could use this capability to find all numbers in a file and multiply them by a given percentage, for instance. Or you could repeat matched strings by using the string repetition operator. Table 10.3 shows all of the options you can use with the substitution operator.
Table 10.3 - Options for the Substitution Operator
Option Description
e This option forces Perl to evaluate the replacement pattern as an expression.
g This option replaces all occurrences of the pattern in the string.
i This option ignores the case of characters in the string.
m This option treats the string as multiple lines. Perl does some optimization by assuming that $_ contains a single line of input. If you know that it contains multiple newline characters, use this option to turn off the optimization.
o This option compiles the pattern only once. You can achieve some small performance gains with this option. It should be used with variable interpolation only when the value of the variable will not change during the lifetime of the program.
s This option treats the string as a single line.
x This option lets you use extended regular expressions. Basically, this means that Perl ignores whitespace that is not escaped with a backslash or within a character class. I highly recommend this option so you can use spaces to make your regular expressions more readable. See the section "Example: Extension Syntax" later in this chapter for more information.
The /e option changes the interpretation of the pattern delimiters. If used, variable interpolation is active even if single quotes are used. In addition, if back quotes are used as delimiters, the replacement pattern is executed as a DOS or UNIX command. The output of the command is then used as the replacement text.

The Translation Operator (tr///)

The translation operator (tr///) is used to change individual characters in the $_ variable. It requires two operands, like this:
tr/a/z/;
This statement translates all occurrences of a into z. If you specify more than one character in the match character list, you can translate multiple characters at a time. For instance:
tr/ab/z/;
translates all a and all b characters into the z character. If the replacement list of characters is shorter than the target list of characters, the last character in the replacement list is repeated as often as needed. However, if more than one replacement character is given for a matched character, only the first is used. For instance:
tr/WWW/ABC/;
results in all W characters being converted to an A character. The rest of the replacement list is ignored. Unlike the matching and substitution operators, the translation operator doesn't perform variable interpolation.
Note
The tr operator gets its name from the UNIX tr utility. If you are familiar with the tr utility, then you already know how to use the tr operator.Z The UNIX sed utility uses a y to indicate translations. To make learning Perl easier for sed users, y is supported as a synonym for tr.

The Translation Options

The translation operator has options different from the matching and substitution operators. You can delete matched characters, replace repeated characters with a single character, and translate only characters that don't match the character list. Table 10.4 shows the translation options.
Table 10.4 - Options for the Translation Operator
Option Description
c This option complements the match character list. In other words, the translation is done for every character that does not match the character list.
d This option deletes any character in the match list that does not have a corresponding character in the replacement list.
s This option reduces repeated instances of matched characters to a single instance of that character.
Normally, if the match list is longer than the replacement list, the last character in the replacement list is used as the replacement for the extra characters. However, when the d option is used, the matched characters are simply deleted.
If the replacement list is empty, then no translation is done. The operator will still return the number of characters that matched, though. This is useful when you need to know how often a given letter appears in a string. This feature also can compress repeated characters using the s option.

Tip
UNIX programmers may be familiar with using the tr utility to convert lowercase characters to uppercase characters, or vice versa. Perl now has the lc() and uc() functions that can do this much quicker.

The Binding Operators (=~ and !~)

The search, modify, and translation operations work on the $_ variable by default. What if the string to be searched is in some other variable? That's where the binding operators come into play. They let you bind the regular expression operators to a variable other than $_. There are two forms of the binding operator: the regular =~ and its complement !~. The following small program shows the syntax of the =~ operator:
$scalar       = "The root has many leaves";
$match        = $scalar =~ m/root/;
$substitution = $scalar =~ s/root/tree/;
$translate    = $scalar =~ tr/h/H/;

print("\$match        = $match\n");
print("\$substitution = $substitution\n");
print("\$translate    = $translate\n");
print("\$scalar       = $scalar\n");
This program displays the following:
$match        = 1
$substitution = 1
$translate    = 2
$scalar       = THe tree Has many leaves
This example uses all three of the regular expression operators with the regular binding operator. Each of the regular expression operators was bound to the $scalar variable instead of $_. This example also shows the return values of the regular expression operators. If you don't need the return values, you could do this:
$scalar = "The root has many leaves";
print("String has root.\n") if $scalar =~ m/root/;
$scalar =~ s/root/tree/;
$scalar =~ tr/h/H/;
print("\$scalar = $scalar\n");
This program displays the following:
String has root.
$scalar = THe tree Has many leaves
The left operand of the binding operator is the string to be searched, modified, or transformed; the right operand is the regular expression operator to be evaluated. The complementary binding operator is valid only when used with the matching regular expression operator. If you use it with the substitution or translation operator, you get the following message if you're using the -w command-line option to run Perl:
Useless use of not in void context at test.pl line 4.
You can see that the !~ is the opposite of =~ by replacing the =~ in the previous example:
$scalar = "The root has many leaves";
print("String has root.\n") if $scalar !~ m/root/;
$scalar =~ s/root/tree/;
$scalar =~ tr/h/H/;
print("\$scalar = $scalar\n");
This program displays the following:
$scalar = THe tree Has many leaves
The first print line does not get executed because the complementary binding operator returns false.

How to Create Patterns

So far in this chapter, you've read about the different operators used with regular expressions, and you've seen how to match simple sequences of characters. Now we'll look at the wide array of meta-characters that are used to harness the full power of regular expressions. Meta-characters are characters that have an additional meaning above and beyond their literal meaning. For example, the period character can have two meanings in a pattern. First, it can be used to match a period character in the searched string - this is its literal meaning. And second, it can be used to match any character in the searched string except for the newline character - this is its meta-meaning. When creating patterns, the meta-meaning will always be the default. If you really intend to match the literal character, you need to prefix the meta-character with a backslash. You might recall that the backslash is used to create an escape sequence.
Patterns can have many different components. These components all combine to provide you with the power to match any type of string. The following list of components will give you a good idea of the variety of ways that patterns can be created. The section "Pattern Examples" later in this chapter shows many examples of these rules in action.
  • Variable Interpolation: Any variable is interpolated, and the essentially new pattern is then evaluated as a regular expression. Remember that only one level of interpolation is done. This means that if the value of the variable includes, for example, $scalar as a string value, then $scalar will not be interpolated. In addition, back-quotes do not interpolate within double-quotes, and single-quotes do not stop interpolation of variables when used within double-quotes.
  • Self-Matching Characters: Any character will match itself unless it is a meta-character or one of $, @, %, &. The meta-characters are listed in Table 10.5, and the other characters are used to begin variable names and function calls. You can use the backslash character to force Perl to match the literal meaning of any character. For example, m/a/; will return true if the letter a is in the $_ variable. And m/\$/; will return true if the character $ is in the $_ variable.
    Table 10.5 - Regular Expression Meta-Characters, Meta-Brackets, and Meta-Sequences
    Meta-Character Description
    ^ This meta-character - the caret - will match the beginning of a string or if the /m option is used, matches the beginning of a line. It is one of two pattern anchors - the other anchor is the $.
    . This meta-character will match any character except for the newline unless the /s option is specified. If the /s option is specified, then the newline will also be matched.
    $ This meta-character will match the end of a string or if the /m option is used, matches the end of a line. It is one of two pattern anchors - the other anchor is the ^.
    | This meta-character - called alternation - lets you specify two values that can cause the match to succeed. For instance, m/a|b/ means that the $_ variable must contain the "a" or "b" character for the match to succeed.
    * This meta-character indicates that the "thing" immediately to the left should be matched 0 or more times in order to be evaluated as true.
    + This meta-character indicates that the "thing" immediately to the left should be matched 1 or more times in order to be evaluated as true.
    ? This meta-character indicates that the "thing" immediately to the left should be matched 0 or 1 times in order to be evaluated as true. When used in conjunction with the +, _, ?, or {n, m} meta- characters and brackets, it means that the regular expression should be non-greedy and match the smallest possible string.

    Meta-Brackets Description
    () The parentheses let you affect the order of pattern evaluation and act as a form of pattern memory. See the section "Pattern Memory" later in this chapter for more information.
    (?...) If a question mark immediately follows the left parentheses, it indicates that an extended mode component is being specified. See the section "Example: Extension Syntax" later in this chapter for more information.
    {n, m} The curly braces let specify how many times the "thing" immediately to the left should be matched. {n} means that it should be matched exactly n times. {n,} means it must be matched at least n times. {n, m} means that it must be matched at least n times and not more than m times.
    [] The square brackets let you create a character class. For instance, m/[abc]/ will evaluate to true if any of "a", "b", or "c" is contained in $_. The square brackets are a more readable alternative to the alternation meta-character.

    Meta-Sequences Description
    \ This meta-character "escapes" the following character. This means that any special meaning normally attached to that character is ignored. For instance, if you need to include a dollar sign in a pattern, you must use \$ to avoid Perl's variable interpolation. Use \\ to specify the backslash character in your pattern.
    \nnn Any Octal byte. Use zero padding for values from \000 to \077 inclusively. For larger values simply use the three-digit number (like \100 or \323).
    \a Alarm.
    \A This meta-sequence represents the beginning of the string. Its meaning is not affected by the /m option.
    \b This meta-sequence represents the backspace character inside a character class; otherwise, it represents a word boundary. A word boundary is the spot between word (\w) and non-word(\W) characters. Perl thinks that the \W meta-sequence matches the imaginary characters off the ends of the string.
    \B Match a non-word boundary.
    \cn Any control character.
    \d Match a single digit character.
    \D Match a single non-digit character.
    \e Escape.
    \E Terminate the \L or \U sequence.
    \f Form Feed.
    \G Match only where the previous m//g left off.
    \l Change the next character to lowercase.
    \L Change the following characters to lowercase until a \E sequence is encountered.
    \n Newline.
    \Q Quote Regular Expression meta-characters literally until the \E sequence is encountered.
    \r Carriage Return.
    \s Match a single whitespace character.
    \S Match a single non-whitespace character.
    \t Tab.
    \u Change the next character to uppercase.
    \U Change the following characters to uppercase until a \E sequence is encountered.
    \v Vertical Tab.
    \w Match a single word character. Word characters are the alphanumeric and underscore characters.
    \W Match a single non-word character.
    \xnn Any Hexadecimal byte.
    \Z This meta-sequence represents the end of the string. Its meaning is not affected by the /m option.
    \$ Dollar Sign.
    \@ Ampersand.
    \% Percent Sign.
    Errata Note
    The \% is not a valid escape sequence for Perl. It was included erroneously. Please ignore this entry. If you need to use the % character in double-quoted strings, go ahead and use it. Later in the book, you'll read about the printf() function. If you want to actually use the % character in the printf() format string, use the %% sequence. - Randal Schwartz was kind enough to identify this error.
  • Character Sequences: A sequence of characters will match the identical sequence in the searched string. The characters need to be in the same order in both the pattern and the searched string for the match to be true. For example, m/abc/; will match "abc" but not "cab" or "bca". If any character in the sequence is a meta-character, you need to use the backslash to match its literal value.
  • Alternation: The alternation meta-character (|) will let you match more than one possible string. For example, m/a|b/; will match if either the "a" character or the "b" character is in the searched string. You can use sequences of more than one character with alternation. For example, m/dog|cat/; will match if either of the strings "dog" or "cat" is in the searched string.
    Tip
    Some programmers like to enclose the alternation sequence inside parentheses to help indicate where the sequence begins and ends.
    m/(dog|cat)/;
    However, this will affect something called pattern memory, which you'll be learning about in the section "Example: Pattern Memory" later in the chapter.
  • Character Classes: The square brackets are used to create character classes. A character class is used to match a specific type of character. For example, you can match any decimal digit using m/[0123456789]/. This will match a single character in the range of zero to nine. You can find more information about character classes in the section called "Example: Character Classes" later in this chapter.
    Errata Note
    The printed version of this book says: m/0123456789/;. The square brackets were missing and the semi-colon is extraneous since this is an example of an expression, not a statement. Randal Schwartz was kind enough to point out this problem.
  • Symbolic Character Classes: There are several character classes that are used so frequently that they have a symbolic representation. The period meta-character stands for a special character class that matches all characters except for the newline. The rest are \d, \D, \s, \S, \w, and \W. These are mentioned in Table 10.5 earlier and are discussed in the section "Example: Character Classes" later in this chapter.
  • Anchors: The caret (^) and the dollar sign meta-characters are used to anchor a pattern to the beginning and the end of the searched string. The caret is always the first character in the pattern when used as an anchor. For example, m/^one/; will only match if the searched string starts with sequence of characters, one. The dollar sign is always the last character in the pattern when used as an anchor. For example, m/(last|end)$/; will match only if the searched string ends with either the character sequence last or the character sequence end. The \A and \Z meta-sequences are also used as pattern anchors for the beginning and end of strings.
    Errata Note
    The printed version of this book states "The caret is always the first character in the pattern when used as an anchor". However, this is not strictly true when the alternation meta-character is used. For example, /Jack$|^John/ will match when "Jack" is at the end of a string or when "John" is at the beginning of a string. Randal Schwartz was kind enough to mention that this concept needs clarification.
  • Quantifiers: There are several meta-characters that are devoted to controlling how many characters are matched. For example, m/a{5}/; means that five a characters must be found before a true result can be returned. The *, +, and ? meta-characters and the curly braces are all used as quantifiers. See the section "Example: Quantifiers" later in this chapter for more information.
  • Pattern Memory: Parentheses are used to store matched values into buffers for later recall. I like to think of this as a form of pattern memory. Some programmers call them back-references. After you use m/(fish|fowl)/; to match a string and a match is found, the variable $1 will hold either fish or fowl depending on which sequence was matched. See the section "Example: Pattern Memory" later in this chapter for more information.
  • Word Boundaries: The \b meta-sequence will match the spot between a space and the first character of a word or between the last character of a word and the space. The \b will match at the beginning or end of a string if there are no leading or trailing spaces. For example, m/\bfoo/; will match foo even without spaces surrounding the word. It will also match $foo because the dollar sign is not considered a word character. The statement m/foo\b/; will match foo but not foobar, and the statement m/\bwiz/; will match wizard but not geewiz. See the section "Example: Character Classes" later in this chapter for more information about word boundaries. The \B meta-sequence will match everywhere except at a word boundary.
  • Quoting Meta-Characters: You can match meta-character literally by enclosing them in a \Q..\E sequence. This will let you avoid using the backslash character to escape all meta-characters, and your code will be easier to read.
  • Extended Syntax: The (?...) sequence lets you use an extended version of the regular expression syntax. The different options are discussed in the section "Example: Extension Syntax" later in this chapter.
  • Combinations: Any of the preceding components can be combined with any other to create simple or complex pattern.
The power of patterns is that you don't always know in advance the value of the string that you will be searching. If you need to match the first word in a string that was read in from a file, you probably have no idea how long it might be; therefore, you need to build a pattern. You might start with the \w symbolic character class, which will match any single alphanumeric or underscore character. So, assuming that the string is in the $_ variable, you can match a one-character word like this:
m/\w/;
If you need to match both a one-character word and a two-character word, you can do this:
m/\w|\w\w/;
This pattern says to match a single word character or two consecutive word characters. You could continue to add alternation components to match the different lengths of words that you might expect to see, but there is a better way. You can use the + quantifier to say that the match should succeed only if the component is matched one or more times. It is used this way:
m/\w+/;
If the value of $_ was "AAA BBB", then m/\w+/; would match the "AAA" in the string. If $_ was blank, full of whitespace, or full of other non-word characters, an undefined value would be returned. The preceding pattern will let you determine if $_ contains a word but does not let you know what the word is. In order to accomplish that, you need to enclose the matching components inside parentheses. For example:
m/(\w+)/;
By doing this, you force Perl to store the matched string into the $1 variable. The $1 variable can be considered as pattern memory. This introduction to pattern components describes most of the details you need to know in order to create your own patterns or regular expressions. However, some of the components deserve a bit more study. The next few sections look at character classes, quantifiers, pattern memory, pattern precedence, and the extension syntax. Then the rest of the chapter is devoted to showing specific examples of when to use the different components.

Example: Character Classes

A character class defines a type of character. The character class [0123456789] defines the class of decimal digits, and [0-9a-f] defines the class of hexadecimal digits. Notice that you can use a dash to define a range of consecutive characters. Character classes let you match any of a range of characters; you don't know in advance which character will be matched. This capability to match non-specific characters is what meta-characters are all about. You can use variable interpolation inside the character class, but you must be careful when doing so. For example,
$_ = "AAABBBCCC";
$charList = "ADE";
print "matched" if m/[$charList]/;
will display
matched
This is because the variable interpolation results in a character class of [ADE]. If you use the variable as one-half of a character range, you need to ensure that you don't mix numbers and digits. For example,
$_ = "AAABBBCCC";
$charList = "ADE";
print "matched" if m/[$charList-9]/;
will result in the following error message when executed:
/[ADE-9]/: invalid [] range in regexp at test.pl line 4.
At times, it's necessary to match on any character except for a given character list. This is done by complementing the character class with the caret. For example,
$_ = "AAABBBCCC";
print "matched" if m/[^ABC]/;
will display nothing. This match returns true only if a character besides A, B, or C is in the searched string. If you complement a list with just the letter A,
$_ = "AAABBBCCC";
print "matched" if m/[^A]/;
then the string "matched" will be displayed because B and C are part of the string - in other words, a character besides the letter A. Perl has shortcuts for some character classes that are frequently used. Here is a list of what I call symbolic character classes:
  • \w - This symbol match any alphanumeric character or the underscore character. It is equivalent to the character class [a-zA-Z0-9_].
  • \W - This symbol matches every character that the \w symbol does not. In other words, it is the complement of \w. It is equivalent to [^a-zA-Z0-9_].
  • \s - This symbol matches any space, tab, or newline character. It is equivalent to [\t \n].
  • \S - This symbol matches any non-whitespace character. It is equivalent to [^\t \n].
  • \d - This symbol match any digit. It is equivalent to [0-9].
  • \D - This symbol matches any non-digit character. It is equivalent to [^0-9].
You can use these symbols inside other character classes but not as endpoints of a range. For example, you can do the following:
$_ = "\tAAA";
print "matched" if m/[\d\s]/;
which will display
matched
because the value of $_ includes the tab character.
Tip
Meta-characters that appear inside the square brackets that define a character class are used in their literal sense. They lose their meta-meaning. This may be a little confusing at first. In fact, I have a tendency to forget this when evaluating patterns.
Note
I think that most of the confusion regarding regular expressions lies in the fact that each character of a pattern might have several possible meanings. The caret could be an anchor, it could be a caret, or it could be used to complement a character class. Therefore, it is vital that you decide which context any given pattern character or symbol is in before assigning a meaning to it.

Example: Quantifiers

Perl provides several different quantifiers that let you specify how many times a given component must be present before the match is true. They are used when you don't know in advance how many characters need to be matched. Table 10.6 lists the different quantifiers that can be used.
Table 10.6 - The Six Types of Quantifiers
Quantifier Description
* The component must be present zero or more times.
+ The component must be present one or more times.
? The component must be present zero or one times.
{n} The component must be present n times.
{n,} The component must be present at least n times.
{n,m} The component must be present at least n times and no more than m times.
If you need to match a word whose length is unknown, you need to use the + quantifier. You can't use an * because a zero length word makes no sense. So, the match statement might look like this:
m/^\w+/;
This pattern will match "QQQ" and "AAAAA" but not "" or " BBB ". In order to account for the leading whitespace, which may or not be at the beginning of a string, you need to use the asterisk (*) quantifier in conjunction with the \s symbolic character class in the following way:
m/\s*\w+/;
Tip
Be careful when using the * quantifier because it can match an empty string, which might not be your intention. The pattern /b*/ will match any string - even one without any b characters.

Errata Note
The printed version of this book has the first match statement as
m/\w+/;
, notice that pattern anchor was left out.
At times, you may need to match an exact number of components. The following match statement will be true only if five words are present in the $_ variable:
$_ = "AA AB AC AD AE";
m/^(\w+\W+){5}$/;
In this example, we are matching at least one word character followed by zero or more non-word characters. Notice that Perl considers the end of a string as a non-word character. The {5} quantifier is used to ensure that that combination of components is present five times.
Errata Note
The printed version of the book used the pattern m/(\w+\s*){5}/; in order to match the five words. This is incorrect since the pattern \w+\s* matches a single character (remember that * matches zero or more instances of a character). Therefore m/(\w+\s*){5}/; matches "AAAA" as well as "A A A A A".
The * and + quantifiers are greedy. They match as many characters as possible. This may not always be the behavior that you need. You can create non-greedy components by following the quantifier with a ?.
Use the following file specification in order to look at the * and + quantifiers more closely:
$_ = '/user/Jackie/temp/names.dat';
The regular expression .* will match the entire file specification. This can be seen in the following small program:
$_ = '/user/Jackie/temp/names.dat';
m/.*/;
print $&;
This program displays
/user/Jackie/temp/names.dat
You can see that the * quantifier is greedy. It matched the whole string. If you add the ? modifier to make the .* component non-greedy, what do you think the program would display?
$_ = '/user/Jackie/temp/names.dat';
m/.*?/;
print $&;
This program displays nothing because the least amount of characters that the * matches is zero. If we change the * to a +, then the program will display
/
Next, let's look at the concept of pattern memory, which lets you keep bits of matched string around after the match is complete.

Example: Pattern Memory

Matching arbitrary numbers of characters is fine, but without the capability to find out what was matched, patterns would be not very useful. Perl lets you enclose pattern components inside parentheses in order to store the string that matched the components into pattern memory. You might also hear pattern memory referred to as pattern buffers. This memory persists after the match statement is finished executing so that you can assign the matched values to other variables. You saw a simple example of this earlier right after the component descriptions. That example looked for the first word in a string and stored it into the first buffer, $1. The following small program
$_ =  "AAA BBB CCC";
m/(\w+)/;
print("$1\n");
will display
AAA
You can use as many buffers as you need. Each time you add a set of parentheses, another buffer is used. The pattern matched by the first set is placed into $1. The pattern matched by the second set is placed into $2. And so on.
If you want to find all the words in the string, you need to use the /g match option. In order to find all the words, you can use a loop statement that loops until the match operator returns false.
$_ =  "AAA BBB CCC";

while (m/(\w+)/g) {
    print("$1\n");
}
The program will display
AAA
BBB
CCC
If looping through the matches is not the right approach for your needs, perhaps you need to create an array consisting of the matches.
$_ =  "AAA BBB CCC";
@matches = m/(\w+)/g;
print("@matches\n");
The program will display
AAA BBB CCC
Perl also has a few special variables to help you know what matched and what did not. These variables will occasionally save you from having to add parentheses to find information.
  • $+ - This variable is assigned the value that the last bracket match matched.
  • $& - This variable is assigned the value of the entire matched string. If the match is not successful, then $& retains its value from the last successful match.
  • $` - This variable is assigned everything in the searched string that is before the matched string.
  • $' - This variable is assigned everything in the search string that is after the matched string.
Tip
If you need to save the value of the matched strings stored in the pattern memory, make sure to assign them to other variables. Pattern memory is local to the enclosing block and lasts only until another match is done.

Example: Pattern Precedence

Pattern components have an order of precedence just as operators do. If you see the following pattern:
m/a|b+/
it's hard to tell if the pattern should be
 m/(a|b)+/  # match any sequence of  "a" and "b" characters
             # in any order.
or
m/a|(b+)/   # match either the "a" character or the "b" character
            # repeated one or more times.
The order of precedence shown in Table 10.7 is designed to solve problems like this. By looking at the table, you can see that quantifiers have a higher precedence than alternation. Therefore, the second interpretation is correct.
Table 10.7 - The Pattern Component Order of Precedence
Precedence Level Component
1 Parentheses
2 Quantifiers
3 Sequences and Anchors
4 Alternation

Tip
You can use parentheses to affect the order that components are evaluated because they have the highest precedence. However, unless you use the extended syntax, you will be affecting the pattern memory.

Example: Extension Syntax

The regular expression extensions are a way to significantly add to the power of patterns without adding a lot of meta-characters to the proliferation that already exists. By using the basic (?...) notation, the regular expression capabilities can be greatly extended. At this time, Perl recognizes five extensions. These vary widely in functionality - from adding comments to setting options. Table 10.8 lists the extensions and gives a short description of each.
Table 10.8 - Five Extension Components
Extension Description
(?# TEXT) This extension lets you add comments to your regular expression. The TEXT value is ignored.
(?:...) This extension lets you add parentheses to your regular expression without causing a pattern memory position to be used.
(?=...) This extension lets you match values without including them in the $& variable.
(?!...) This extension lets you specify what should not follow your pattern. For instance, /blue(?!bird)/ means that "bluebox" and "bluesy" will be matched but not "bluebird".
(?sxi) This extension lets you specify an embedded option in the pattern rather than adding it after the last delimiter. This is useful if you are storing patterns in variables and using variable interpolation to do the matching.
By far the most useful feature of extended mode, in my opinion, is the ability to add comments directly inside your patterns. For example, would you rather a see a pattern that looks like this:
# Match a string with two words. $1 will be the
# first word. $2 will be the second word.
m/^\s*(\w+)\W+(\w+)\s*$/;
or one that looks like this:
m/
    (?# This pattern will match any string with two)
    (?# and only two words in it. The matched words)
    (?# will be available in $1 and $2 if the match)
    (?# is successful.)

    ^      (?# Anchor this match to the beginning)
           (?# of the string)

    \s*    (?# skip over any whitespace characters)
           (?# use the * because there may be none)

    (\w+)  (?# Match the first word, we know it's)
           (?# the first word because of the anchor)
           (?# above. Place the matched word into)
           (?# pattern memory.)

    \W+    (?# Match at least one non-word)
           (?# character, there may be more than one)

    (\w+)  (?# Match another word, put into pattern)
           (?# memory also.)

    \s*    (?# skip over any whitespace characters)
           (?# use the * because there may be none)

    $      (?# Anchor this match to the end of the)
           (?# string. Because both ^ and $ anchors)
           (?# are present, the entire string will)
           (?# need to match the pattern. A)
           (?# sub-string that fits the pattern will)
           (?# not match.)
/x;
Of course, the commented pattern is much longer, but they take the same amount of time to execute. In addition, it will be much easier to maintain the commented pattern because each component is explained. When you know what each component is doing in relation to the rest of the pattern, it becomes easy to modify its behavior when the need arises. Extensions also let you change the order of evaluation without affecting pattern memory. For example,
m/(?:a|b)+/;
matches the a or b characters repeated one or more times in any order. The pattern memory will not be affected. At times, you might like to include a pattern component in your pattern without including it in the $& variable that holds the matched string. The technical term for this is a zero-width positive look-ahead assertion. You can use this to ensure that the string following the matched component is correct without affecting the matched value. For example, if you have some data that looks like this:
David    Veterinarian 56
Jackie  Orthopedist 34
Karen Veterinarian 28
and you want to find all veterinarians and store the value of the first column, you can use a look-ahead assertion. This will do both tasks in one step. For example:
while (<>) {
    push(@array, $&) if m/^\w+(?=\s+Vet)/;
}

print("@array\n");
This program will display:
David Karen
Let's look at the pattern with comments added using the extended mode. In this case, it doesn't make sense to add comments directly to the pattern because the pattern is part of the if statement modifier. Adding comments in that location would make the comments hard to format. So let's use a different tactic.
$pattern = '^\w+     (?# Match the first word in the string)

            (?=\s+   (?# Use a look-ahead assertion to match)
                     (?# one or more whitespace characters)

               Vet)  (?# In addition to the whitespace, make)
                     (?# sure that the next column starts)
                     (?# with the character sequence "Vet")
           ';

while (<>) {
    push(@array, $&) if m/$pattern/x;
}

print("@array\n");
Here we used a variable to hold the pattern and then used variable interpolation in the pattern with the match operator. You might want to pick a more descriptive variable name than $pattern, however.
Tip
Although the Perl documentation does not mention it, I believe you have only one look-ahead assertion per pattern, and it must be the last pattern component.
The last extension that we'll discuss is the zero-width negative assertion. This type of component is used to specify values that shouldn't follow the matched string. For example, using the same data as in the previous example, you can look for everyone who is not a veterinarian. Your first inclination might be to simply replace the (?=...) with the (?!...) in the previous example.
 while (<>) {
    push(@array, $&) if m/^\w+(?!\s+Vet)/;
}

print("@array\n");
Unfortunately, this program displays
Davi Jackie Kare
which is not what you need. The problem is that Perl is looking at the last character of the word to see if it matches the Vet character sequence. In order to correctly match the first word, you need to explicitly tell Perl that the first word ends at a word boundary, like this:
while (<>) {
    push(@array, $&) if m/^\w+\b(?!\s+Vet)/;
}

print("@array\n");
This program displays
Jackie
which is correct.
Tip
There are many ways of matching any value. If the first method you try doesn't work, try breaking the value into smaller components and match each boundary. If all else fails, you can always ask for help on the comp.lang.perl.misc newsgroup.

Pattern Examples

In order to demonstrate many different patterns, I will depart from the standard example format in this section. Instead, I will explain a matching situation in italicized text and then a possible resolution will immediately follow. After the resolution, I'll add some comments to explain how the match is done. In all of these examples, the string to search will be in the $_ variable.

Example: Using the Match Operator

  • If you need to find repeated characters in a string like the AA in "ABC AA ABC", then do this:
    m/(.)\1/;
    This pattern uses pattern memory to store a single character. Then a back-reference (\1) is used to repeat the first character. The back-reference is used to reference the pattern memory while still inside the pattern. Anywhere else in the program, use the $1 variable. After this statement, $1 will hold the repeated character. This pattern will match two of any non-newline character.
  • If you need to find the first word in a string, then do this:
    m/^\s*(\w+)/;
    After this statement, $1 will hold the first word in the string. Any whitespace at the beginning of the string will be skipped by the \s* meta-character sequence. Then the \w+ meta-character sequence will match the next word. Note that the * - which matches zero or more - is used to match the whitespace because there may not be any. The + - which matches one or more - is used for the word.
  • If you need to find the last word in a string, then do this:
    m/
        (\w+)      (?# Match a word, store its value into pattern memory)
    
        [.!?]?     (?# Some strings might hold a sentence. If so, this)
                   (?# component will match zero or one punctuation)
                   (?# characters)
    
        \s*        (?# Match trailing whitespace using the * because there)
                   (?# might not be any)
    
        $          (?# Anchor the match to the end of the string)
    /x;
    After this statement, $1 will hold the last word in the string. You need to expand the character class, [.!?], by adding more punctuation.
  • If you need to know that there are only two words in a string, you can do this:
    m/^(\w+)\W+(\w+)$/x;
    After this statement, $1 will hold the first word and $2 will hold the second word, assuming that the pattern matches. The pattern starts with a caret and ends with a dollar sign, which means that the entire string must match the pattern. The \w+ meta-character sequence matches one word. The \W+ meta-character sequence matches the whitespace between words. You can test for additional words by adding one \W+(\w+) meta-character sequence for each additional word to match.
  • If you need to know that there are only two words in a string while ignoring leading or trailing spaces, you can do this:
    m/^\s*(\w+)\W+(\w+)\s*$/;
    After this statement, $1 will hold the first word and $2 will hold the second word, assuming that the pattern matches. The \s* meta-character sequence will match any leading or trailing whitespace.
  • If you need to assign the first two words in a string to $one and $two and the rest of the string to $rest, you can do this:
    $_ = "This is the way to San Jose.";
    
    $word   = '\w+';    # match a whole word.
    
    $space  = '\W+';    # match at least one character of whitespace
    
    $string = '.*';     # match any number of anything except
                        # for the newline character.
    
    ($one, $two, $rest) = (m/^($word) $space ($word) $space ($string)/x);
    After this statement, $one will hold the first word, $two will hold the second word, and $rest will hold everything else in the $_ variable. This example uses variable interpolation to, hopefully, make the match pattern easier to read. This technique also emphasizes which meta-sequence is used to match words and whitespace. It lets the reader focus on the whole of the pattern rather than the individual pattern components by adding a level of abstraction.
  • If you need to see if $_ contains a legal Perl variable name, you can do this:
    $result = m/
                ^          (?# Anchor the pattern to the start of the string)
    
                [\$\@\%]   (?# Use a character class to match the first)
                           (?# character of a variable name)
    
                [a-z]      (?# Use a character class to ensure that the)
                           (?# character of the name is a letter)
    
                \w*        (?# Use a character class to ensure that the)
                           (?# rest of the variable name is either an)
                           (?# alphanumeric or an underscore character)
    
                $          (?# Anchor the pattern to the end of the)
                           (?# string. This means that for the pattern to)
                           (?# match, the variable name must be the only)
                           (?# value in $_.)
    
              /ix;         # Use the /i option so that the search is
                           # case-insensitive and use the /x option to
                           # allow extensions.
    After this statement, $result will be true if $_ contains a legal variable name and false if it does not.
  • If you need to see if $_ contains a legal integer literal, you can do this:
    $result = m/
                (?# First check for just numbers in $_)
    
                ^         (?# Anchor to the start of the string)
                \d+       (?# Match one or more digits)
                $         (?# Anchor to the end of the string)
    
                |         (?# or)
    
               (?# Now check for hexadecimal numbers)
    
                ^         (?# Anchor to the start of the string)
                0x        (?# The "0x" sequence starts a hexadecimal number)
                [\da-f]+  (?# Match one or more hexadecimal characters)
                $         (?# Anchor to the end of the string)
              /ix;
    
    After this statement, $result will be true if $_ contains an integer literal and false if it does not.
  • If you need to match all legal integers in $_, you can do this:
    @results = m/\d+$|^0[x][\da-f]+/gi;
    After this statement, @result will contain a list of all integer literals in $_. @result will contain an empty list if no literals were found.
  • If you need to match the end of the first word in a string, you can do this:
    m/\w\W/;
    After this statement is executed, $& will hold the last character of the first word and the next character that follows it. If you want only the last character, use pattern memory, m/(\w)\W/;. Then $1 will be equal to the last character of the first word. If you use the global option, @array = m/\w\W/g;, then you can create an array that holds the last character of each word in the string.
  • If you need to match the start of the second word in a string, you can do this:
    m/\W\w/;
    After this statement, $& will hold the first character of the second word and the whitespace character that immediately precedes it. While this pattern is the opposite of the pattern that matches the end of words, it will not match the beginning of the first word! This is because of the \W meta-character. Simply adding a * meta-character to the pattern after the \W does not help, because then it would match on zero non-word characters and therefore match every word character in the string.
  • If you need to match the file name in a file specification, you can do this:
    $_ = '/user/Jackie/temp/names.dat';
    m!^.*/(.*)!;
    After this match statement, $1 will equal names.dat. The match is anchored to the beginning of the string, and the .* component matches everything up to the last slash because regular expressions are greedy. Then the next (.*) matches the file name and stores it into pattern memory. You can store the file path into pattern memory by placing parentheses around the first .* component.
  • If you need to match two prefixes and one root word, like "rockfish" and "monkfish," you can do this:
    m/(?:rock|monk)fish/x;
    The alternative meta-character is used to say that either rock or monk followed by fish needs to be found. If you need to know which alternative was found, then use regular parentheses in the pattern. After the match, $1 will be equal to either rock or monk.
  • If you want to search a file for a string and print some of the surrounding lines, you can do this:
    # read the whole file into memory.
    open(FILE, "<fndstr.dat");
    @array = <FILE>;
    close(FILE);
    
    # specify which string to find.
    $stringToFind = "A";
    
    # iterate over the array looking for the
    # string. The $#array notation is used to
    # determine the number of elements in the
    # array.
    for ($index = 0; $index <= $#array; $index++) {
        last if $array[$index] =~ /$stringToFind/;
    }
    
    # Use $index to print two lines before
    # and two lines after the line that contains
    # the match.
    foreach (@array[$index-2..$index+2]) {
        print("$index: $_");
        $index++;
    }
    There are many ways to perform this type of search, and this is just one of them. This technique is only good for relatively small files because the entire file is read into memory at once. In addition, the program assumes that the input file always contains the string that you are looking for.

Example: Using the Substitution Operator

  • If you need to remove whitespace from the beginning of a string, you can do this:
    s/^\s+//;
    This pattern uses the \s predefined character class to match any whitespace character. The plus sign means to match one or more whitespace characters, and the caret means match only at the beginning of the string.
  • If you need to remove whitespace from the end of a string, you can do this:
    s/\s+$//;
    This pattern uses the \s predefined character class to match any whitespace character. The plus sign means to match one or more whitespace characters, and the dollar sign means match only at the end of the string.
  • If you need to add a prefix to a string, you can do this:
    $prefix = "A";
    s/^(.*)/$prefix$1/;
    When the substitution is done, the value in the $prefix variable will be added to the beginning of the $_ variable. This is done by using variable interpolation and pattern memory. Of course, you might also consider using the string concatenation operator; for instance, $_ = "A" . $_;, which is probably faster.
  • If you need to add a suffix to a string, you can do this:
    $suffix = "Z";
    s/^(.*)/$1$suffix/;
    When the substitution is done, the value in the $suffix variable will be added to the end of the $_ variable. This is done by using variable interpolation and pattern memory. Of course, you might also consider using the string concatenation operator; for instance, $_ .= "Z";, which is probably faster.
  • If you need to reverse the first two words in a string, you can do this:
    s/^\s*(\w+)\W+(\w+)/$2 $1/;
    This substitution statement uses the pattern memory variables $1 and $2 to reverse the first two words in a string. You can use a similar technique to manipulate columns of information, the last two words, or even to change the order of more than two matches.
  • If you need to duplicate each character in a string, you can do this:
    s/\w/$& x 2/eg;
    When the substitution is done, each character in $_ will be repeated. If the original string was "123abc", the new string would be "112233aabbcc". The e option is used to force evaluation of the replacement string. The $& special variable is used in the replacement pattern to reference the matched string, which is then repeated by the string repetition operator.
  • If you need to capitalize all the words in a sentence, you can do this:
    s/(\w+)/\u$1/g;
    When the substitution is done, each character in $_ will have its first letter capitalized. The /g option means that each word - the \w+ meta-sequence - will be matched and placed in $1. Then it will be replaced by \u$1. The \u will capitalize whatever follows it; in this case, it's the matched word.
  • If you need to insert a string between two repeated characters, you can do this:
    $_      = "!!!!";
    $char   = "!";
    $insert = "AAA";
    
    s{
        ($char)             # look for the specified character.
    
        (?=$char)           # look for it again, but don't include
                            # it the matched string, so the next
    }                       # search will also find it.
    {
        $char . $insert     # concatenate the specified character
                            # with the string to insert.
    
    }xeg;                   # use extended mode, evaluate the
                            # replacement pattern, and match all
                            # possible strings.
    
    print("$_\n");
    This example uses the extended mode to add comments directly inside the regular expression. This makes it easy to relate the comment directly to a specific pattern element. The match pattern does not directly reflect the originally stated goal of inserting a string between two repeated characters. Instead, the example was quietly restated. The new goal is to substitute all instances of $char with $char . $insert, if $char is followed by $char. As you can see, the end result is the same. Remember that sometimes you need to think outside the box.
  • If you need to do a second level of variable interpolation in the replacement pattern, you can do this:
    s/(\$\w+)/$1/eeg;
    This is a simple example of secondary variable interpolation. If $firstVar = "AAA" and $_ = '$firstVar', then $_ would be equal to "AAA" after the substitution was made. The key is that the replacement pattern is evaluated twice. This technique is very powerful. It can be used to develop error messages used with variable interpolation.
     $errMsg = "File too large";
    $fileName = "DATA.OUT";
    $_ = 'Error: $errMsg for the file named $fileName';
    s/(\$\w+)/$1/eeg;
    print;
    When this program is run, it will display
    Error: File too large for the file named DATA.OUT
    The values of the $errMsg and $fileName variables were interpolated into the replacement pattern as needed.

Example: Using the Translation Operator

  • If you need to count the number of times a given letter appears in a string, you can do this:
    $cnt = tr/Aa//;
    After this statement executes, $cnt will hold the number of times the letter a appears in $_. The tr operator does not have an option to ignore the case of the string, so both upper- and lowercase need to be specified.
  • If you need to turn the high bit off for every character in $_, you can do this:
    tr [\200-\377] [\000-\177];
    This statement uses the square brackets to delimit the character lists. Notice that spaces can be used between the pairs of brackets to enhance readability of the lists. The octal values are used to specify the character ranges. The translation operator is more efficient - in this instance - than using logical operators and a loop statement. This is because the translation can be done by creating a simple lookup table.

Example: Using the Split() Function

  • If you need to split a string into words, you can do this:
    s/^\s+//;
    @array = split;
    After this statement executes, @array will be an array of words. Before splitting the string, you need to remove any beginning whitespace. If this is not done, split will create an array element with the whitespace as the first element in the array, and this is probably not what you want.
  • If you need to split a string contained in $line instead of $_ into words, you can do this:
    $line =~ s/^\s+//;
    @array = split(/\W/, $line);
    After this statement executes, @array will be an array of words.
  • If you need to split a string into characters, you can do this:
    @array = split(//);
    After this statement executes, @array will be an array of characters. split recognizes the empty pattern as a request to make every character into a separate array element.
  • If you need to split a string into fields based on a delimiter sequence of characters, you can do this:
    @array = split(/:/);
    @array will be an array of strings consisting of the values between the delimiters. If there are repeated delimiters - :: in this example - then an empty array element will be created. Use /:+/ as the delimiter to match in order to eliminate the empty array elements.

Summary

This chapter introduced you to regular expressions or patterns, regular expression operators, and the binding operators. There are three regular expression operators - m//, s///, and tr/// - which are used to match, substitute, and translate and use the $_ variable as the default operand. The binding operators, =~ and !~, are used to bind the regular expression operators to a variable other than $_. While the slash character is the default pattern delimiter, you can use any character in its place. This feature is useful if the pattern contains the slash character. If you use an opening bracket or parenthesis as the beginning delimiter, use the closing bracket or parenthesis as the ending delimiter. Using the single-quote as the delimiter will turn off variable interpolation for the pattern.
The matching operator has six options: /g, /i, /m, /o, /s, and /x. These options were described in Table 10.2. I've found that the /x option is very helpful for creating maintainable, commented programs. The /g option, used to find all matches in a string, is also very useful. And, of course, the capability to create case-insensitive patterns using the /i option is crucial in many cases.
The substitution operator has the same options as the matching operator and one more - the /e option. The /e option lets you evaluate the replacement pattern and use the new value as the replacement string. If you use back-quotes as delimiters, the replacement pattern will be executed as a DOS or UNIX command, and the resulting output will become the replacement string.
The translation operator has three options: /c, /d, and /s. These options are used to complement the match character list, delete characters not in the match character list, and eliminate repeated characters in a string. If no replacement list is specified, the number of matched characters will be returned. This is handy if you need to know how many times a given character appears in a string.
The binding operators are used to force the matching, substitution, and translation operators to search a variable other than $_. The =~ operator can be used with all three of the regular expression operators, while the !~ operator can be used only with the matching operator.
Quite a bit of space was devoted to creating patterns, and the topic deserves even more space. This is easily one of the more involved features of the Perl language. One key concept is that a character can have multiple meanings. For example, the plus sign can mean a plus sign in one instance (its literal meaning), and in another it means match something one or more times (its meta-meaning).
You learned about regular expression components and that they can be combined in an infinite number of ways. Table 10.5 listed most of the meta-meanings for different characters. You read about character classes, alternation, quantifiers, anchors, pattern memory, word boundaries, and extended components.
The last section of the chapter was devoted to presenting numerous examples of how to use regular expressions to accomplish specific goals. Each situation was described, and a pattern that matched that situation was shown. Some commentary was given for each example.
In the next chapter, you'll read about how to present information by using formats. Formats are used to help relieve some of the programming burden from the task of creating reports.

Review Questions

  1. Can you use variable interpolation with the translation operator?
  2. What happens if the pattern is empty?
  3. What variable does the substitution operator use as its default?
  4. Will the following line of code work?
     m{.*];
  5. What is the /g option of the substitution operator used for?
  6. What does the \d meta-character sequence mean?
  7. What is the meaning of the dollar sign in the following pattern?
    /AA[.<]$]ER/
  8. What is a word boundary?
  9. What will be displayed by the following program?
    $_ = 'AB AB AC';
    print m/c$/i;

Review Exercises

  1. Write a pattern that matches either "top" or "topgun".
  2. Write a program that accepts input from STDIN and changes all instances of the letter a into the letter b.
  3. Write a pattern that stores the first character to follow a tab into pattern memory.
  4. Write a pattern that matches the letter g between 3 and 7 times.
  5. Write a program that finds repeated words in an input file and prints the repeated word and the line number on which it was found.
  6. Create a character class for octal numbers.
  7. Write a program that uses the translation operator to remove repeated instances of the tab character and then replaces the tab character with a space character.
  8. Write a pattern that matches either "top" or "topgun" using a zero-width positive look-ahead assertion.
ref : http://affy.blogspot.com/p5be/ch10.htm#The%20Matching%20Operator%20%28m//%29

Tidak ada komentar:

Posting Komentar