You can use a
regular expression to find patterns in strings: for
example, to look for a specific name in a phone list or all of the names that
start with the letter
a. Pattern matching is one of Perl's most powerful
and probably least understood features. But after you read this chapter, you'll
be able to handle regular expressions almost as well as a Perl guru. With a
little practice, you'll be able to do some incredibly handy things.
There are three main uses for regular expressions in Perl: matching,
substitution, and translation. The matching operation uses the
m//
operator, which evaluates to a true or false value. The substitution operation
substitutes one expression for another; it uses the
s/// operator. The
translation operation translates one set
of characters to another and uses the
tr/// operator. These operators are summarized in Table 10.1.
Table 10.1 - Perl's Regular Expression Operators
Operator |
Description |
m/PATTERN/ |
This operator returns true if PATTERN is found in
$_. |
s/PATTERN/REPLACEMENT/ |
This operator replaces the sub- string matched by PATTERN
with REPLACEMENT. |
tr/CHARACTERS/REPLACEMENTS/ |
This operator replaces characters specified by CHARACTERS
with the characters in REPLACEMENTS. |
All three regular expression operators work with
$_ as the string to
search. You can use the binding operators (see the section "
The
Binding Operators (=~ and !~)" later in this section) to search a variable
other than
$_.
Both the matching (
m//) and the substitution (
s///)
operators perform variable interpolation on the
PATTERN and
REPLACEMENT strings. This comes in handy if you need to read the
pattern from the keyboard or a file.
If the match pattern evaluates to the empty string, the last valid pattern is
used. So, if you see a statement like
print if //; in a Perl program,
look for the previous regular expression operator to see what the pattern really
is. The substitution operator also uses this interpretation of the empty
pattern.
In this chapter, you learn about pattern delimiters and then about each type
of regular expression operator. After that, you learn how to create patterns in
the section"
How to
Create Patterns" .. Then, the "
Pattern
Examples" section shows you some situations and how regular expressions can
be used to resolve the situations.
Every regular
expression operator allows the use of alternative
pattern delimiters. A
delimiter marks the beginning and end of a given pattern. In the
following statement,
m//;
you see two of the standard delimiters - the slashes
(
//). However, you can use any character as the delimiter. This feature
is useful if you want to use the slash character inside your pattern. For
instance, to match a file you would normally use:
m/\/root\/home\/random.dat/
This match statement is hard to read
because all of the slashes seem to run together (some programmers say they look
like teepees). If you use an alternate delimiter, if might look like this:
m!/root/home/random.dat!
or
m{/root/home/random.dat}
You can see that these examples are a
little clearer. The last example also shows that if a left bracket is used as
the starting delimiter, then the ending delimiter must be the right bracket.
Errata
Note |
The printed version of this book shows the above
examples as m!\/root\/home\/random.dat! and as
m{\/root\/home\/random.dat}. While I was writing the book it did
not occur to be that the / character was not a metacharacter and only
needed to be escaped because of the delimiters. Obviously, if the /
character is the delimiter, it needs to be escaped in order to use it
inside the pattern. However, if an alternative delimiter is used, it no
longer needs to be escaped. - this fact was pointed out to me by Garen
Deve. |
Both the match and substitution operators let you use variable interpolation.
You can take advantage of this to use a single-quoted string that does not
require the slash to be escaped. For instance:
$file = '/root/home/random.dat';
m/$file/;
You might find that this technique yields clearer code than
simply changing the delimiters.
If you choose the single quote as your delimiter character, then no variable
interpolation is performed on the pattern. However, you still need to use the
backslash character to escape any of the meta-characters discussed in the "
How to
Create Patterns" section later in this chapter.
Tip |
I tend to avoid delimiters that might be confused
with characters in the pattern. For example, using the plus sign as a
delimiter (m+abc+) does not help program readability. A casual
reader might think that you intend to add two expressions instead of
matching them. |
Caution |
The ? has a special meaning when used as a
match pattern delimiter. It works like the / delimiter except
that it matches only once between calls to the reset() function.
This feature may be removed in future versions of Perl, so avoid using
it. |
The next few sections look at the matching, substitution, and translation
operators in more detail.
The matching operator (
m//) is used to find
patterns in strings. One of its more common uses is to look for a specific
string inside a data file. For instance, you might look for all customers whose
last name is "Johnson" or you might need a list of all names starting with the
letter
s.
The matching operator only searches the
$_ variable. This makes the
match statement shorter because you don't need to specify where to search. Here
is a quick example:
$_ = "AAA bbb AAA";
print "Found bbb\n" if m/bbb/;
The print statement is executed only if
the
bbb character sequence is found in the
$_ variable. In
this particular case,
bbb will be found, so the program will display
the following:
Found bbb
The matching operator allows you to use variable
interpolation in order to create the pattern. For example:
$needToFind = "bbb";
$_ = "AAA bbb AAA";
print "Found bbb\n" if m/$needToFind/;
Using the matching operator is
so commonplace that Perl allows you to leave off the
m from the
matching operator as long as slashes are used as delimiters:
$_ = "AAA bbb AAA";
print "Found bbb\n" if /bbb/;
Using the matching operator to find a
string inside a file is very easy because the defaults are designed to
facilitate this activity. For example:
$target = "M";
open(INPUT, "<findstr.dat");
while (<INPUT>) {
if (/$target/) {
print "Found $target on line $.";
}
}
close(INPUT);
Note |
The $. special variable keeps track of the
record number. Every time the diamond operators read a line, this variable
is incremented. |
This example reads every line in an input searching for the letter
M. When an
M is found, the print statement is executed. The
print statement prints the letter that is found and the line number it was found
on.
The matching
operator has several options that enhance its utility. The most useful option is
probably the capability to ignore case and to create an array of all matches in
a string. Table 10.2 shows the options you can use with the matching operator.
Table 10.2 - Options for the Matching Operator
Option |
Description |
g |
This option finds all occurrences of the pattern in the
string. A list of matches is returned or you can iterate over the matches
using a loop statement. |
i |
This option ignores the case of characters in the
string. |
m |
This option treats the string as multiple lines. Perl does
some optimization by assuming that $_ contains a single line of
input. If you know that it contains multiple newline characters, use this
option to turn off the optimization. |
o |
This option compiles the pattern only once. You can achieve
some small performance gains with this option. It should be used with
variable interpolation only when the value of the variable will not change
during the lifetime of the program. |
s |
This option treats the string as a single line. |
x |
This option lets you use extended regular expressions.
Basically, this means that Perl will ignore whitespace that's not escaped
with a backslash or within a character class. I highly recommend this
option so you can use spaces to make your regular expressions more
readable. See the section "Example: Extension Syntax" later in this
chapter for more information. |
All options are specified after the last pattern delimiter. For instance, if
you want the match to ignore the case of the characters in the string, you can
do this:
$_ = "AAA BBB AAA";
print "Found bbb\n" if m/bbb/i;
This program finds a match even though
the pattern uses lowercase and the string uses uppercase because the
/i
option was used, telling Perl to ignore the case.
The result from a global pattern match can be assigned to an array variable
or used inside a loop. This feature comes in handy after you learn about
meta-characters in the section called "
How to
Create Patterns" later in this chapter.
The substitution operator (
s///) is used to
change strings. It requires two operands, like this:
s/a/z/;
This statement changes the first
a in
$_
into a
z. Not too complicated, huh? Things won't get complicated until
we start talking about regular expressions in earnest in the section "
How to
Create Patterns" later in the chapter.
You can use variable interpolation with the substitution operator just as you
can with the matching operator. For instance:
$needToReplace = "bbb";
$replacementText = "1234567890";
$_ = "AAA bbb AAA";
$result = s/$needToReplace/$replacementText/;
Note |
You can use variable interpolation in the
replacement pattern as shown here, but none of the meta-characters
described later in the chapter can be used in the replacement
pattern. |
This program changes the
$_ variable to hold
"AAA 1234567890
AAA" instead of its original value, and the
$result variable will
be equal to 1 - the number of substitutions made.
Frequently, the substitution operator is used to remove substrings. For
instance, if you want to remove the
"bbb" sequence of characters from
the
$_ variable, you could do this:
s/bbb//;
By replacing the matched string with nothing, you have
effectively deleted it.
If brackets of any type are used as delimiters for the search pattern, you
need to use a second set of brackets to enclose the replacement pattern. For
instance:
$_ = "AAA bbb AAA";
$result = s{bbb}{1234567890};
Like the
matching operator, the substitution operator has several options. One
interesting option is the capability to evaluate the replacement pattern as an
expression instead of a string. You could use this capability to find all
numbers in a file and multiply them by a given percentage, for instance. Or you
could repeat matched strings by using the string repetition operator. Table 10.3
shows all of the options you can use with the substitution operator.
Table 10.3 - Options for the Substitution Operator
Option |
Description |
e |
This option forces Perl to evaluate the replacement pattern
as an expression. |
g |
This option replaces all occurrences of the pattern in the
string. |
i |
This option ignores the case of characters in the
string. |
m |
This option treats the string as multiple lines. Perl does
some optimization by assuming that $_ contains a single line of
input. If you know that it contains multiple newline characters, use this
option to turn off the optimization. |
o |
This option compiles the pattern only once. You can achieve
some small performance gains with this option. It should be used with
variable interpolation only when the value of the variable will not change
during the lifetime of the program. |
s |
This option treats the string as a single line. |
x |
This option lets you use extended regular expressions.
Basically, this means that Perl ignores whitespace that is not escaped
with a backslash or within a character class. I highly recommend this
option so you can use spaces to make your regular expressions more
readable. See the section "Example:
Extension Syntax" later in this chapter for more
information. |
The
/e option changes the interpretation of the pattern delimiters.
If used, variable interpolation is active even if single quotes are used. In
addition, if back quotes are used as delimiters, the replacement pattern is
executed as a DOS or UNIX command. The output of the command is then used as the
replacement text.
The translation operator (
tr///) is used to
change individual characters in the
$_ variable. It requires two
operands, like this:
tr/a/z/;
This statement translates all occurrences of
a
into
z. If you specify more than one character in the match character
list, you can translate multiple characters at a time. For instance:
tr/ab/z/;
translates all
a and all
b characters
into the
z character. If the replacement list of characters is shorter
than the target list of characters, the last character in the replacement list
is repeated as often as needed. However, if more than one replacement character
is given for a matched character, only the first is used. For instance:
tr/WWW/ABC/;
results in all
W characters being converted
to an
A character. The rest of the replacement list is ignored.
Unlike the matching and substitution operators, the translation operator
doesn't perform variable interpolation.
Note |
The tr operator gets its name from the UNIX
tr utility. If you are familiar with the tr utility, then you already know
how to use the tr operator.Z
The UNIX sed utility uses a y to indicate translations. To make
learning Perl easier for sed users, y is supported as a synonym for
tr. |
The
translation operator has options different from the matching and substitution
operators. You can delete matched characters, replace repeated characters with a
single character, and translate only characters that don't match the character
list. Table 10.4 shows the translation options.
Table 10.4 - Options for the Translation Operator
Option |
Description |
c |
This option complements the match character list. In other
words, the translation is done for every character that does not match the
character list. |
d |
This option deletes any character in the match list that
does not have a corresponding character in the replacement list. |
s |
This option reduces repeated instances of matched
characters to a single instance of that character. |
Normally, if the match list is longer than the replacement list, the last
character in the replacement list is used as the replacement for the extra
characters. However, when the
d option is used, the matched characters
are simply deleted.
If the replacement list is empty, then no translation is done. The operator
will still return the number of characters that matched, though. This is useful
when you need to know how often a given letter appears in a string. This feature
also can compress repeated characters using the
s option.
Tip |
UNIX programmers may be familiar with using the tr
utility to convert lowercase characters to uppercase characters, or vice
versa. Perl now has the lc() and uc() functions that can
do this much quicker. |
The search, modify, and translation operations work on
the
$_ variable by default. What if the string to be searched is in
some other variable? That's where the binding operators come into play. They let
you bind the regular expression operators to a variable other than
$_.
There are two forms of the binding operator: the regular
=~ and its
complement
!~. The following small program shows the syntax of the
=~ operator:
$scalar = "The root has many leaves";
$match = $scalar =~ m/root/;
$substitution = $scalar =~ s/root/tree/;
$translate = $scalar =~ tr/h/H/;
print("\$match = $match\n");
print("\$substitution = $substitution\n");
print("\$translate = $translate\n");
print("\$scalar = $scalar\n");
This program displays the
following:
$match = 1
$substitution = 1
$translate = 2
$scalar = THe tree Has many leaves
This example uses all three of
the regular expression operators with the regular binding operator. Each of the
regular expression operators was bound to the
$scalar variable instead
of
$_. This example also shows the return values of the regular
expression operators. If you don't need the return values, you could do this:
$scalar = "The root has many leaves";
print("String has root.\n") if $scalar =~ m/root/;
$scalar =~ s/root/tree/;
$scalar =~ tr/h/H/;
print("\$scalar = $scalar\n");
This program displays the following:
String has root.
$scalar = THe tree Has many leaves
The left operand of the binding
operator is the string to be searched, modified, or transformed; the right
operand is the regular expression operator to be evaluated.
The complementary binding operator is valid only when used with the matching
regular expression operator. If you use it with the substitution or translation
operator, you get the following message if you're using the
-w
command-line option to run Perl:
Useless use of not in void context at test.pl line 4.
You can see
that the
!~ is the opposite of
=~ by replacing the
=~
in the previous example:
$scalar = "The root has many leaves";
print("String has root.\n") if $scalar !~ m/root/;
$scalar =~ s/root/tree/;
$scalar =~ tr/h/H/;
print("\$scalar = $scalar\n");
This program displays the following:
$scalar = THe tree Has many leaves
The first print line does not
get executed because the complementary binding operator returns false.
So far in
this chapter, you've read about the different operators used with regular
expressions, and you've seen how to match simple sequences of characters. Now
we'll look at the wide array of meta-characters that are used to harness the
full power of regular expressions.
Meta-characters are characters that
have an additional meaning above and beyond their literal meaning. For example,
the period character can have two meanings in a pattern. First, it can be used
to match a period character in the searched string - this is its
literal
meaning. And second, it can be used to match
any character in the
searched string except for the newline character - this is its
meta-meaning.
When creating patterns, the meta-meaning will always be the default. If you
really intend to match the literal character, you need to prefix the
meta-character with a backslash. You might recall that the backslash is used to
create an escape sequence.
Patterns can have many different components. These components all combine to
provide you with the power to match any type of string. The following list of
components will give you a good idea of the variety of ways that patterns can be
created. The section "
Pattern
Examples" later in this chapter shows many examples of these rules in
action.
- Variable Interpolation: Any variable is interpolated, and the
essentially new pattern is then evaluated as a regular expression. Remember
that only one level of interpolation is done. This means that if the value of
the variable includes, for example, $scalar as a string value, then
$scalar will not be interpolated. In addition, back-quotes do not
interpolate within double-quotes, and single-quotes do not stop interpolation
of variables when used within double-quotes.
- Self-Matching Characters: Any character will match itself unless it
is a meta-character or one of $, @, %,
&. The meta-characters are listed in Table 10.5, and the other
characters are used to begin variable names and function calls. You can use
the backslash character to force Perl to match the literal meaning of any
character. For example, m/a/; will return true if the letter
a is in the $_ variable. And m/\$/; will return
true if the character $ is in the $_ variable.
Table 10.5 - Regular Expression Meta-Characters, Meta-Brackets, and
Meta-Sequences
Meta-Character |
Description |
^ |
This meta-character - the caret - will match the
beginning of a string or if the /m option is used, matches the
beginning of a line. It is one of two pattern anchors - the other anchor
is the $. |
. |
This meta-character will match any character except for
the newline unless the /s option is specified. If the
/s option is specified, then the newline will also be
matched. |
$ |
This meta-character will match the end of a string or if
the /m option is used, matches the end of a line. It is one of
two pattern anchors - the other anchor is the ^. |
| |
This meta-character - called alternation - lets you
specify two values that can cause the match to succeed. For instance,
m/a|b/ means that the $_ variable must contain the
"a" or "b" character for the match to succeed. |
* |
This meta-character indicates that the "thing"
immediately to the left should be matched 0 or more times in order to be
evaluated as true. |
+ |
This meta-character indicates that the "thing"
immediately to the left should be matched 1 or more times in order to be
evaluated as true. |
? |
This meta-character indicates that the "thing"
immediately to the left should be matched 0 or 1 times in order to be
evaluated as true. When used in conjunction with the +,
_, ?, or {n, m} meta- characters and
brackets, it means that the regular expression should be non-greedy and
match the smallest possible string. |
Meta-Brackets |
Description |
() |
The parentheses let you affect the order of pattern
evaluation and act as a form of pattern memory. See the section "Pattern
Memory" later in this chapter for more information. |
(?...) |
If a question mark immediately follows the left
parentheses, it indicates that an extended mode component is being
specified. See the section "Example:
Extension Syntax" later in this chapter for more information. |
{n, m} |
The curly braces let specify how many times the "thing"
immediately to the left should be matched. {n} means that it
should be matched exactly n times. {n,} means it must be
matched at least n times. {n, m} means that it must be matched
at least n times and not more than m times. |
[] |
The square brackets let you create a character class. For
instance, m/[abc]/ will evaluate to true if any of
"a", "b", or "c" is contained in $_.
The square brackets are a more readable alternative to the alternation
meta-character. |
Meta-Sequences |
Description |
\ |
This meta-character "escapes" the following character. This means
that any special meaning normally attached to that character is ignored.
For instance, if you need to include a dollar sign in a pattern, you
must use \$ to avoid Perl's variable interpolation. Use
\\ to specify the backslash character in your pattern. |
\nnn |
Any Octal byte. Use zero padding for values from \000 to \077
inclusively. For larger values simply use the three-digit number (like
\100 or \323). |
\a |
Alarm. |
\A |
This meta-sequence represents the beginning of the string. Its
meaning is not affected by the /m option. |
\b |
This meta-sequence represents the backspace character inside a
character class; otherwise, it represents a word boundary. A word
boundary is the spot between word (\w) and
non-word(\W) characters. Perl thinks that the \W
meta-sequence matches the imaginary characters off the ends of the
string. |
\B |
Match a non-word boundary. |
\cn |
Any control character. |
\d |
Match a single digit character. |
\D |
Match a single non-digit character. |
\e |
Escape. |
\E |
Terminate the \L or \U sequence. |
\f |
Form Feed. |
\G |
Match only where the previous m//g left off. |
\l |
Change the next character to lowercase. |
\L |
Change the following characters to lowercase until a \E
sequence is encountered. |
\n |
Newline. |
\Q |
Quote Regular Expression meta-characters literally until the
\E sequence is encountered. |
\r |
Carriage Return. |
\s |
Match a single whitespace character. |
\S |
Match a single non-whitespace character. |
\t |
Tab. |
\u |
Change the next character to uppercase. |
\U |
Change the following characters to uppercase until a \E
sequence is encountered. |
\v |
Vertical Tab. |
\w |
Match a single word character. Word characters are the alphanumeric
and underscore characters. |
\W |
Match a single non-word character. |
\xnn |
Any Hexadecimal byte. |
\Z |
This meta-sequence represents the end of the string. Its meaning is
not affected by the /m option. |
\$ |
Dollar Sign. |
\@ |
Ampersand. |
\% |
Percent Sign.
Errata
Note |
The \% is not a valid escape sequence for
Perl. It was included erroneously. Please ignore this entry. If
you need to use the % character in double-quoted strings, go ahead
and use it. Later in the book, you'll read about the printf()
function. If you want to actually use the % character in the
printf() format string, use the %% sequence. - Randal Schwartz was
kind enough to identify this
error. |
|
- Character Sequences: A sequence of characters will match the
identical sequence in the searched string. The characters need to be in the
same order in both the pattern and the searched string for the match to be
true. For example, m/abc/; will match "abc" but not
"cab" or "bca". If any character in the sequence is a
meta-character, you need to use the backslash to match its literal value.
- Alternation: The alternation meta-character (|)
will let you match more than one possible string. For example,
m/a|b/; will match if either the "a" character or the
"b" character is in the searched string. You can use sequences of
more than one character with alternation. For example, m/dog|cat/;
will match if either of the strings "dog" or "cat" is in the
searched string.
Tip |
Some programmers like to enclose the alternation
sequence inside parentheses to help indicate where the sequence begins
and ends.
m/(dog|cat)/;
However, this will affect something called
pattern memory, which you'll be learning about in the section "Example:
Pattern Memory" later in the chapter. |
- Character Classes: The square brackets are used to create character
classes. A character class is used to match a specific type of
character. For example, you can match any decimal digit using
m/[0123456789]/. This will match a single character in the range of
zero to nine. You can find more information about character classes in the
section called "Example:
Character Classes" later in this chapter.
Errata
Note |
The printed version of this book says:
m/0123456789/;. The square brackets were missing and the
semi-colon is extraneous since this is an example of an expression, not
a statement. Randal Schwartz was kind enough to point out this
problem. |
- Symbolic Character Classes: There are several character classes
that are used so frequently that they have a symbolic representation. The
period meta-character stands for a special character class that matches all
characters except for the newline. The rest are \d, \D,
\s, \S, \w, and \W. These are mentioned in
Table 10.5 earlier and are discussed in the section "Example:
Character Classes" later in this chapter.
- Anchors: The caret (^) and the dollar sign meta-characters
are used to anchor a pattern to the beginning and the end of the searched
string. The caret is always the first character in the pattern when used as an
anchor. For example, m/^one/; will only match if the searched string
starts with sequence of characters, one. The dollar sign is always
the last character in the pattern when used as an anchor. For example,
m/(last|end)$/; will match only if the searched string ends with
either the character sequence last or the character sequence
end. The \A and \Z meta-sequences are also used as
pattern anchors for the beginning and end of strings.
Errata
Note |
The printed version of this book states "The caret
is always the first character in the pattern when used as an anchor".
However, this is not strictly true when the alternation meta-character
is used. For example, /Jack$|^John/ will match when "Jack" is at the end
of a string or when "John" is at the beginning of a string. Randal
Schwartz was kind enough to mention that this concept needs
clarification. |
- Quantifiers: There are several meta-characters that are devoted to
controlling how many characters are matched. For example, m/a{5}/;
means that five a characters must be found before a true result can
be returned. The *, +, and ? meta-characters and
the curly braces are all used as quantifiers. See the section "Example:
Quantifiers" later in this chapter for more information.
- Pattern Memory: Parentheses are used to store matched values into
buffers for later recall. I like to think of this as a form of pattern memory.
Some programmers call them back-references. After you use
m/(fish|fowl)/; to match a string and a match is found, the variable
$1 will hold either fish or fowl depending on which
sequence was matched. See the section "Example:
Pattern Memory" later in this chapter for more information.
- Word Boundaries: The \b meta-sequence will match the spot
between a space and the first character of a word or between the last
character of a word and the space. The \b will match at the beginning
or end of a string if there are no leading or trailing spaces. For example,
m/\bfoo/; will match foo even without spaces surrounding the
word. It will also match $foo because the dollar sign is not
considered a word character. The statement m/foo\b/; will match
foo but not foobar, and the statement m/\bwiz/;
will match wizard but not geewiz. See the section "Example:
Character Classes" later in this chapter for more information about word
boundaries.
The \B meta-sequence will match everywhere except at a word
boundary.
- Quoting Meta-Characters: You can match meta-character literally by
enclosing them in a \Q..\E sequence. This will let you avoid using
the backslash character to escape all meta-characters, and your code will be
easier to read.
- Extended Syntax: The (?...) sequence lets you use an extended
version of the regular expression syntax. The different options are discussed
in the section "Example:
Extension Syntax" later in this chapter.
- Combinations: Any of the preceding components can be combined with
any other to create simple or complex pattern.
The power of patterns is that you don't always know in advance the value of
the string that you will be searching. If you need to match the first word in a
string that was read in from a file, you probably have no idea how long it might
be; therefore, you need to build a pattern. You might start with the
\w
symbolic character class, which will match any single alphanumeric or underscore
character. So, assuming that the string is in the
$_ variable, you can
match a one-character word like this:
m/\w/;
If you need to match both a one-character word and a
two-character word, you can do this:
m/\w|\w\w/;
This pattern says to match a single word character or
two consecutive word characters. You could continue to add alternation
components to match the different lengths of words that you might expect to see,
but there is a better way.
You can use the
+ quantifier to say that the match should succeed
only if the component is matched one or more times. It is used this way:
m/\w+/;
If the value of
$_ was
"AAA BBB", then
m/\w+/; would match the
"AAA" in the string. If
$_
was blank, full of whitespace, or full of other non-word characters, an
undefined value would be returned.
The preceding pattern will let you determine if
$_ contains a word
but does not let you know what the word is. In order to accomplish that, you
need to enclose the matching components inside parentheses. For example:
m/(\w+)/;
By doing this, you force Perl to store the matched
string into the
$1 variable. The
$1 variable can be considered
as pattern memory.
This introduction to pattern components describes most of the details you
need to know in order to create your own patterns or regular expressions.
However, some of the components deserve a bit more study. The next few sections
look at character classes, quantifiers, pattern memory, pattern precedence, and
the extension syntax. Then the rest of the chapter is devoted to showing
specific examples of when to use the different components.
A
character class defines a type of character. The character class [0123456789]
defines the class of decimal digits, and [0-9a-f] defines the class of
hexadecimal digits. Notice that you can use a dash to define a range of
consecutive characters. Character classes let you match any of a range of
characters; you don't know in advance which character will be matched. This
capability to match non-specific characters is what meta-characters are all
about.
You can use variable interpolation inside the character class, but you must
be careful when doing so. For example,
$_ = "AAABBBCCC";
$charList = "ADE";
print "matched" if m/[$charList]/;
will display
matched
This is because the variable interpolation results in a
character class of [ADE]. If you use the variable as one-half of a character
range, you need to ensure that you don't mix numbers and digits. For example,
$_ = "AAABBBCCC";
$charList = "ADE";
print "matched" if m/[$charList-9]/;
will result in the following error
message when executed:
/[ADE-9]/: invalid [] range in regexp at test.pl line 4.
At times,
it's necessary to match on any character except for a given character list. This
is done by complementing the character class with the caret. For example,
$_ = "AAABBBCCC";
print "matched" if m/[^ABC]/;
will display nothing. This match returns
true only if a character besides
A,
B, or
C is in the
searched string. If you complement a list with just the letter
A,
$_ = "AAABBBCCC";
print "matched" if m/[^A]/;
then the string
"matched" will be
displayed because
B and
C are part of the string - in other
words, a character besides the letter
A.
Perl has shortcuts for some character classes that are frequently used. Here
is a list of what I call symbolic character classes:
- \w - This symbol match any alphanumeric character or the
underscore character. It is equivalent to the character class
[a-zA-Z0-9_].
- \W - This symbol matches every character that the
\w symbol does not. In other words, it is the complement of
\w. It is equivalent to [^a-zA-Z0-9_].
- \s - This symbol matches any space, tab, or newline
character. It is equivalent to [\t \n].
- \S - This symbol matches any non-whitespace character. It
is equivalent to [^\t \n].
- \d - This symbol match any digit. It is equivalent to
[0-9].
- \D - This symbol matches any non-digit character. It is
equivalent to [^0-9].
You can use these symbols inside other character classes but not as endpoints
of a range. For example, you can do the following:
$_ = "\tAAA";
print "matched" if m/[\d\s]/;
which will display
matched
because the value of
$_ includes the tab
character.
Tip |
Meta-characters that appear inside the square
brackets that define a character class are used in their literal sense.
They lose their meta-meaning. This may be a little confusing at first. In
fact, I have a tendency to forget this when evaluating
patterns. |
Note |
I think that most of the confusion regarding regular
expressions lies in the fact that each character of a pattern might have
several possible meanings. The caret could be an anchor, it could be a
caret, or it could be used to complement a character class. Therefore, it
is vital that you decide which context any given pattern character or
symbol is in before assigning a meaning to it. |
Perl provides
several different quantifiers that let you specify how many times a given
component must be present before the match is true. They are used when you don't
know in advance how many characters need to be matched. Table 10.6 lists the
different quantifiers that can be used.
Table 10.6 - The Six Types of Quantifiers
Quantifier |
Description |
* |
The component must be present zero or more times. |
+ |
The component must be present one or more times. |
? |
The component must be present zero or one times. |
{n} |
The component must be present n times. |
{n,} |
The component must be present at least n times. |
{n,m} |
The component must be present at least n times and no more
than m times. |
If you need to match a word whose length is unknown, you need to use the
+ quantifier. You can't use an
* because a zero length word
makes no sense. So, the match statement might look like this:
m/^\w+/;
This pattern will match
"QQQ" and
"AAAAA" but not
"" or
" BBB ". In order to account
for the leading whitespace, which may or not be at the beginning of a string,
you need to use the asterisk (
*) quantifier in conjunction with the
\s symbolic character class in the following way:
m/\s*\w+/;
Tip |
Be careful when using the * quantifier
because it can match an empty string, which might not be your intention.
The pattern /b*/ will match any string - even one without any
b characters. |
Errata
Note |
The printed version of this book has the first match
statement as
m/\w+/;
, notice that pattern anchor was left
out. |
At times, you may need to match an exact number of components. The following
match statement will be true only if five words are present in the
$_
variable:
$_ = "AA AB AC AD AE";
m/^(\w+\W+){5}$/;
In this example, we are matching at least one word
character followed by zero or more non-word characters. Notice that Perl
considers the end of a string as a non-word character. The
{5}
quantifier is used to ensure that that combination of components is present five
times.
Errata
Note |
The printed version of the book used the pattern
m/(\w+\s*){5}/; in order to match the five words. This is
incorrect since the pattern \w+\s* matches a single character
(remember that * matches zero or more instances of a character).
Therefore m/(\w+\s*){5}/; matches "AAAA" as well as "A A A A
A". |
The
* and
+ quantifiers are greedy. They match as many
characters as possible. This may not always be the behavior that you need. You
can create non-greedy components by following the quantifier with a
?.
Use the following file specification in order to look at the
* and
+ quantifiers more closely:
$_ = '/user/Jackie/temp/names.dat';
The regular expression
.* will match the entire file specification. This can be seen in the
following small program:
$_ = '/user/Jackie/temp/names.dat';
m/.*/;
print $&;
This program displays
/user/Jackie/temp/names.dat
You can see that the
*
quantifier is greedy. It matched the whole string. If you add the
?
modifier to make the
.* component non-greedy, what do you think the
program would display?
$_ = '/user/Jackie/temp/names.dat';
m/.*?/;
print $&;
This program displays nothing because the least amount of
characters that the
* matches is zero. If we change the
* to a
+, then the program will display
/
Next, let's look at the concept of pattern memory, which lets
you keep bits of matched string around after the match is complete.
Matching
arbitrary numbers of characters is fine, but without the capability to find out
what was matched, patterns would be not very useful. Perl lets you enclose
pattern components inside parentheses in order to store the string that matched
the components into pattern memory. You might also hear
pattern memory
referred to as
pattern buffers. This memory persists after the match
statement is finished executing so that you can assign the matched values to
other variables.
You saw a simple example of this earlier right after the component
descriptions. That example looked for the first word in a string and stored it
into the first buffer,
$1. The following small program
$_ = "AAA BBB CCC";
m/(\w+)/;
print("$1\n");
will display
AAA
You can use as many buffers as you need. Each time you add a set of
parentheses, another buffer is used. The pattern matched by the first set is
placed into $1. The pattern matched by the second set is placed into $2. And so
on.
If you want to find all the words in the string, you need to use the
/g match option. In order to find all the words, you can use a loop
statement that loops until the match operator returns false.
$_ = "AAA BBB CCC";
while (m/(\w+)/g) {
print("$1\n");
}
The program will display
AAA
BBB
CCC
If looping through the matches is not the right approach for your
needs, perhaps you need to create an array consisting of the matches.
$_ = "AAA BBB CCC";
@matches = m/(\w+)/g;
print("@matches\n");
The program will display
AAA BBB CCC
Perl also has a few special variables to help you know
what matched and what did not. These variables will occasionally save you from
having to add parentheses to find information.
- $+ - This variable is assigned the value that the last
bracket match matched.
- $& - This variable is assigned the value of the entire
matched string. If the match is not successful, then $& retains
its value from the last successful match.
- $` - This variable is assigned everything in the searched
string that is before the matched string.
- $' - This variable is assigned everything in the search
string that is after the matched string.
Tip |
If you need to save the value of the matched strings
stored in the pattern memory, make sure to assign them to other variables.
Pattern memory is local to the enclosing block and lasts only until
another match is done. |
Pattern components have an order of precedence just as
operators do. If you see the following pattern:
m/a|b+/
it's hard to tell if the pattern should be
m/(a|b)+/ # match any sequence of "a" and "b" characters
# in any order.
or
m/a|(b+)/ # match either the "a" character or the "b" character
# repeated one or more times.
The order of precedence shown
in Table 10.7 is designed to solve problems like this. By looking at the table,
you can see that quantifiers have a higher precedence than alternation.
Therefore, the second interpretation is correct.
Table 10.7 - The Pattern Component Order of Precedence
Precedence Level |
Component |
1 |
Parentheses |
2 |
Quantifiers |
3 |
Sequences and Anchors |
4 |
Alternation |
Tip |
You can use parentheses to affect the order that
components are evaluated because they have the highest precedence.
However, unless you use the extended syntax, you will be affecting the
pattern memory. |
The
regular expression extensions are a way to significantly add to the power of
patterns without adding a lot of meta-characters to the proliferation that
already exists. By using the basic (?...) notation, the regular expression
capabilities can be greatly extended.
At this time, Perl recognizes five extensions. These vary widely in
functionality - from adding comments to setting options. Table 10.8 lists the
extensions and gives a short description of each.
Table 10.8 - Five Extension Components
Extension |
Description |
(?# TEXT) |
This extension lets you add comments to your regular
expression. The TEXT value is ignored. |
(?:...) |
This extension lets you add parentheses to your regular
expression without causing a pattern memory position to be used. |
(?=...) |
This extension lets you match values without including them
in the $& variable. |
(?!...) |
This extension lets you specify what should not follow your
pattern. For instance, /blue(?!bird)/ means that
"bluebox" and "bluesy" will be matched but not
"bluebird". |
(?sxi) |
This extension lets you specify an embedded option in the
pattern rather than adding it after the last delimiter. This is useful if
you are storing patterns in variables and using variable interpolation to
do the matching. |
By far the most useful feature of extended mode, in my opinion, is the
ability to add comments directly inside your patterns. For example, would you
rather a see a pattern that looks like this:
# Match a string with two words. $1 will be the
# first word. $2 will be the second word.
m/^\s*(\w+)\W+(\w+)\s*$/;
or one that looks like this:
m/
(?# This pattern will match any string with two)
(?# and only two words in it. The matched words)
(?# will be available in $1 and $2 if the match)
(?# is successful.)
^ (?# Anchor this match to the beginning)
(?# of the string)
\s* (?# skip over any whitespace characters)
(?# use the * because there may be none)
(\w+) (?# Match the first word, we know it's)
(?# the first word because of the anchor)
(?# above. Place the matched word into)
(?# pattern memory.)
\W+ (?# Match at least one non-word)
(?# character, there may be more than one)
(\w+) (?# Match another word, put into pattern)
(?# memory also.)
\s* (?# skip over any whitespace characters)
(?# use the * because there may be none)
$ (?# Anchor this match to the end of the)
(?# string. Because both ^ and $ anchors)
(?# are present, the entire string will)
(?# need to match the pattern. A)
(?# sub-string that fits the pattern will)
(?# not match.)
/x;
Of course, the commented pattern is much longer, but they take the
same amount of time to execute. In addition, it will be much easier to maintain
the commented pattern because each component is explained. When you know what
each component is doing in relation to the rest of the pattern, it becomes easy
to modify its behavior when the need arises.
Extensions also let you change the order of evaluation without affecting
pattern memory. For example,
m/(?:a|b)+/;
matches the
a or
b characters
repeated one or more times in any order. The pattern memory will not be
affected.
At times, you might like to include a pattern component in your pattern
without including it in the
$& variable that holds the matched
string. The technical term for this is a
zero-width positive look-ahead
assertion. You can use this to ensure that the string following the matched
component is correct without affecting the matched value. For example, if you
have some data that looks like this:
David Veterinarian 56
Jackie Orthopedist 34
Karen Veterinarian 28
and you want to find all veterinarians and store
the value of the first column, you can use a look-ahead assertion. This will do
both tasks in one step. For example:
while (<>) {
push(@array, $&) if m/^\w+(?=\s+Vet)/;
}
print("@array\n");
This program will display:
David Karen
Let's look at the pattern with comments added using
the extended mode. In this case, it doesn't make sense to add comments directly
to the pattern because the pattern is part of the
if statement
modifier. Adding comments in that location would make the comments hard to
format. So let's use a different tactic.
$pattern = '^\w+ (?# Match the first word in the string)
(?=\s+ (?# Use a look-ahead assertion to match)
(?# one or more whitespace characters)
Vet) (?# In addition to the whitespace, make)
(?# sure that the next column starts)
(?# with the character sequence "Vet")
';
while (<>) {
push(@array, $&) if m/$pattern/x;
}
print("@array\n");
Here we used a variable to hold the pattern and then
used variable interpolation in the pattern with the match operator. You might
want to pick a more descriptive variable name than
$pattern, however.
Tip |
Although the Perl documentation does not mention it,
I believe you have only one look-ahead assertion per pattern, and it must
be the last pattern component. |
The last extension that we'll discuss is the
zero-width negative
assertion. This type of component is used to specify values that shouldn't
follow the matched string. For example, using the same data as in the previous
example, you can look for everyone who is not a veterinarian. Your first
inclination might be to simply replace the
(?=...) with the
(?!...) in the previous example.
while (<>) {
push(@array, $&) if m/^\w+(?!\s+Vet)/;
}
print("@array\n");
Unfortunately, this program displays
Davi Jackie Kare
which is not what you need. The problem is that
Perl is looking at the last character of the word to see if it matches the
Vet character sequence. In order to correctly match the first word, you
need to explicitly tell Perl that the first word ends at a word boundary, like
this:
while (<>) {
push(@array, $&) if m/^\w+\b(?!\s+Vet)/;
}
print("@array\n");
This program displays
Jackie
which is correct.
Tip |
There are many ways of matching any value. If the
first method you try doesn't work, try breaking the value into smaller
components and match each boundary. If all else fails, you can always ask
for help on the comp.lang.perl.misc
newsgroup. |
In order to demonstrate
many different patterns, I will depart from the standard example format in this
section. Instead, I will explain a matching situation in italicized text and
then a possible resolution will immediately follow. After the resolution, I'll
add some comments to explain how the match is done. In all of these examples,
the string to search will be in the
$_ variable.
- If you need to find repeated characters in a string like the AA in "ABC AA
ABC", then do this:
m/(.)\1/;
This pattern uses pattern memory to store a single
character. Then a back-reference (\1) is used to repeat the first
character. The back-reference is used to reference the pattern memory while
still inside the pattern. Anywhere else in the program, use the $1
variable. After this statement, $1 will hold the repeated character.
This pattern will match two of any non-newline character.
- If you need to find the first word in a string, then do this:
m/^\s*(\w+)/;
After this statement, $1 will hold the
first word in the string. Any whitespace at the beginning of the string will
be skipped by the \s* meta-character sequence. Then the \w+
meta-character sequence will match the next word. Note that the * -
which matches zero or more - is used to match the whitespace because there may
not be any. The + - which matches one or more - is used for the word.
- If you need to find the last word in a string, then do this:
m/
(\w+) (?# Match a word, store its value into pattern memory)
[.!?]? (?# Some strings might hold a sentence. If so, this)
(?# component will match zero or one punctuation)
(?# characters)
\s* (?# Match trailing whitespace using the * because there)
(?# might not be any)
$ (?# Anchor the match to the end of the string)
/x;
After this statement, $1 will hold the last word in the
string. You need to expand the character class, [.!?], by adding more
punctuation.
- If you need to know that there are only two words in a string, you can do
this:
m/^(\w+)\W+(\w+)$/x;
After this statement, $1 will hold
the first word and $2 will hold the second word, assuming that the
pattern matches. The pattern starts with a caret and ends with a dollar sign,
which means that the entire string must match the pattern. The \w+
meta-character sequence matches one word. The \W+ meta-character
sequence matches the whitespace between words. You can test for additional
words by adding one \W+(\w+) meta-character sequence for each
additional word to match.
- If you need to know that there are only two words in a string while
ignoring leading or trailing spaces, you can do this:
m/^\s*(\w+)\W+(\w+)\s*$/;
After this statement, $1 will
hold the first word and $2 will hold the second word, assuming that
the pattern matches. The \s* meta-character sequence will match any
leading or trailing whitespace.
- If you need to assign the first two words in a string to $one and
$two and the rest of the string to $rest, you can do this:
$_ = "This is the way to San Jose.";
$word = '\w+'; # match a whole word.
$space = '\W+'; # match at least one character of whitespace
$string = '.*'; # match any number of anything except
# for the newline character.
($one, $two, $rest) = (m/^($word) $space ($word) $space ($string)/x);
After
this statement, $one will hold the first word, $two will
hold the second word, and $rest will hold everything else in the
$_ variable. This example uses variable interpolation to, hopefully,
make the match pattern easier to read. This technique also emphasizes which
meta-sequence is used to match words and whitespace. It lets the reader focus
on the whole of the pattern rather than the individual pattern components by
adding a level of abstraction.
- If you need to see if $_ contains a legal Perl variable name, you
can do this:
$result = m/
^ (?# Anchor the pattern to the start of the string)
[\$\@\%] (?# Use a character class to match the first)
(?# character of a variable name)
[a-z] (?# Use a character class to ensure that the)
(?# character of the name is a letter)
\w* (?# Use a character class to ensure that the)
(?# rest of the variable name is either an)
(?# alphanumeric or an underscore character)
$ (?# Anchor the pattern to the end of the)
(?# string. This means that for the pattern to)
(?# match, the variable name must be the only)
(?# value in $_.)
/ix; # Use the /i option so that the search is
# case-insensitive and use the /x option to
# allow extensions.
After this statement,
$result will be true if $_ contains a legal variable name
and false if it does not.
- If you need to see if $_ contains a legal integer literal, you
can do this:
$result = m/
(?# First check for just numbers in $_)
^ (?# Anchor to the start of the string)
\d+ (?# Match one or more digits)
$ (?# Anchor to the end of the string)
| (?# or)
(?# Now check for hexadecimal numbers)
^ (?# Anchor to the start of the string)
0x (?# The "0x" sequence starts a hexadecimal number)
[\da-f]+ (?# Match one or more hexadecimal characters)
$ (?# Anchor to the end of the string)
/ix;
After this statement, $result will be true if $_
contains an integer literal and false if it does not.
- If you need to match all legal integers in $_, you can do this:
@results = m/\d+$|^0[x][\da-f]+/gi;
After this statement,
@result will contain a list of all integer literals in $_.
@result will contain an empty list if no literals were found.
- If you need to match the end of the first word in a string, you can do
this:
m/\w\W/;
After this statement is executed, $& will
hold the last character of the first word and the next character that follows
it. If you want only the last character, use pattern memory,
m/(\w)\W/;. Then $1 will be equal to the last character of
the first word. If you use the global option, @array = m/\w\W/g;,
then you can create an array that holds the last character of each word in the
string.
- If you need to match the start of the second word in a string, you can do
this:
m/\W\w/;
After this statement, $& will hold the
first character of the second word and the whitespace character that
immediately precedes it. While this pattern is the opposite of the pattern
that matches the end of words, it will not match the beginning of the first
word! This is because of the \W meta-character. Simply adding a
* meta-character to the pattern after the \W does not help,
because then it would match on zero non-word characters and therefore match
every word character in the string.
- If you need to match the file name in a file specification, you can do
this:
$_ = '/user/Jackie/temp/names.dat';
m!^.*/(.*)!;
After this match statement, $1 will equal
names.dat. The match is anchored to the beginning of the string, and
the .* component matches everything up to the last slash because
regular expressions are greedy. Then the next (.*) matches the file
name and stores it into pattern memory. You can store the file path into
pattern memory by placing parentheses around the first .* component.
- If you need to match two prefixes and one root word, like "rockfish" and
"monkfish," you can do this:
m/(?:rock|monk)fish/x;
The alternative meta-character is used to
say that either rock or monk followed by fish needs
to be found. If you need to know which alternative was found, then use regular
parentheses in the pattern. After the match, $1 will be equal to
either rock or monk.
- If you want to search a file for a string and print some of the
surrounding lines, you can do this:
# read the whole file into memory.
open(FILE, "<fndstr.dat");
@array = <FILE>;
close(FILE);
# specify which string to find.
$stringToFind = "A";
# iterate over the array looking for the
# string. The $#array notation is used to
# determine the number of elements in the
# array.
for ($index = 0; $index <= $#array; $index++) {
last if $array[$index] =~ /$stringToFind/;
}
# Use $index to print two lines before
# and two lines after the line that contains
# the match.
foreach (@array[$index-2..$index+2]) {
print("$index: $_");
$index++;
}
There are many ways to perform this type of search, and this is
just one of them. This technique is only good for relatively small files
because the entire file is read into memory at once. In addition, the program
assumes that the input file always contains the string that you are looking
for.
- If you need to remove whitespace from the beginning of a string, you can
do this:
s/^\s+//;
This pattern uses the \s predefined character
class to match any whitespace character. The plus sign means to match one or
more whitespace characters, and the caret means match only at the beginning of
the string.
- If you need to remove whitespace from the end of a string, you can do
this:
s/\s+$//;
This pattern uses the \s predefined character
class to match any whitespace character. The plus sign means to match one or
more whitespace characters, and the dollar sign means match only at the end of
the string.
- If you need to add a prefix to a string, you can do this:
$prefix = "A";
s/^(.*)/$prefix$1/;
When the substitution is done, the value in the
$prefix variable will be added to the beginning of the $_
variable. This is done by using variable interpolation and pattern memory. Of
course, you might also consider using the string concatenation operator; for
instance, $_ = "A" . $_;, which is probably faster.
- If you need to add a suffix to a string, you can do this:
$suffix = "Z";
s/^(.*)/$1$suffix/;
When the substitution is done, the value in the
$suffix variable will be added to the end of the $_
variable. This is done by using variable interpolation and pattern memory. Of
course, you might also consider using the string concatenation operator; for
instance, $_ .= "Z";, which is probably faster.
- If you need to reverse the first two words in a string, you can do this:
s/^\s*(\w+)\W+(\w+)/$2 $1/;
This substitution statement uses the
pattern memory variables $1 and $2 to reverse the first two
words in a string. You can use a similar technique to manipulate columns of
information, the last two words, or even to change the order of more than two
matches.
- If you need to duplicate each character in a string, you can do this:
s/\w/$& x 2/eg;
When the substitution is done, each
character in $_ will be repeated. If the original string was
"123abc", the new string would be "112233aabbcc". The
e option is used to force evaluation of the replacement string. The
$& special variable is used in the replacement pattern to
reference the matched string, which is then repeated by the string repetition
operator.
- If you need to capitalize all the words in a sentence, you can do this:
s/(\w+)/\u$1/g;
When the substitution is done, each character in
$_ will have its first letter capitalized. The /g option
means that each word - the \w+ meta-sequence - will be matched and
placed in $1. Then it will be replaced by \u$1. The
\u will capitalize whatever follows it; in this case, it's the
matched word.
- If you need to insert a string between two repeated characters, you can do
this:
$_ = "!!!!";
$char = "!";
$insert = "AAA";
s{
($char) # look for the specified character.
(?=$char) # look for it again, but don't include
# it the matched string, so the next
} # search will also find it.
{
$char . $insert # concatenate the specified character
# with the string to insert.
}xeg; # use extended mode, evaluate the
# replacement pattern, and match all
# possible strings.
print("$_\n");
This example uses the extended mode to add comments
directly inside the regular expression. This makes it easy to relate the
comment directly to a specific pattern element. The match pattern does not
directly reflect the originally stated goal of inserting a string between two
repeated characters. Instead, the example was quietly restated. The new goal
is to substitute all instances of $char with $char .
$insert, if $char is followed by $char. As you can see, the
end result is the same. Remember that sometimes you need to think outside the
box.
- If you need to do a second level of variable interpolation in the
replacement pattern, you can do this:
s/(\$\w+)/$1/eeg;
This is a simple example of secondary variable
interpolation. If $firstVar = "AAA" and $_ = '$firstVar',
then $_ would be equal to "AAA" after the substitution was
made. The key is that the replacement pattern is evaluated twice. This
technique is very powerful. It can be used to develop error messages used with
variable interpolation.
$errMsg = "File too large";
$fileName = "DATA.OUT";
$_ = 'Error: $errMsg for the file named $fileName';
s/(\$\w+)/$1/eeg;
print;
When this program is run, it will display
Error: File too large for the file named DATA.OUT
The values of
the $errMsg and $fileName variables were interpolated into
the replacement pattern as needed.
This chapter introduced you to regular
expressions or patterns, regular expression operators, and the binding
operators. There are three regular expression operators -
m//,
s///, and
tr/// - which are used to match, substitute, and
translate and use the
$_ variable as the default operand. The binding
operators,
=~ and
!~, are used to bind the regular expression
operators to a variable other than
$_.
While the slash character is the default pattern delimiter, you can use any
character in its place. This feature is useful if the pattern contains the slash
character. If you use an opening bracket or parenthesis as the beginning
delimiter, use the closing bracket or parenthesis as the ending delimiter. Using
the single-quote as the delimiter will turn off variable interpolation for the
pattern.
The matching operator has six options:
/g,
/i,
/m,
/o,
/s, and
/x. These options were described in Table
10.2. I've found that the
/x option is very helpful for creating
maintainable, commented programs. The
/g option, used to find all
matches in a string, is also very useful. And, of course, the capability to
create case-insensitive patterns using the
/i option is crucial in many
cases.
The substitution operator has the same options as the matching operator and
one more - the
/e option. The
/e option lets you evaluate the
replacement pattern and use the new value as the replacement string. If you use
back-quotes as delimiters, the replacement pattern will be executed as a DOS or
UNIX command, and the resulting output will become the replacement string.
The translation operator has three options:
/c,
/d, and
/s. These options are used to complement the match character list,
delete characters not in the match character list, and eliminate repeated
characters in a string. If no replacement list is specified, the number of
matched characters will be returned. This is handy if you need to know how many
times a given character appears in a string.
The binding operators are used to force the matching, substitution, and
translation operators to search a variable other than
$_. The
=~ operator can be used with all three of the regular expression
operators, while the
!~ operator can be used only with the matching
operator.
Quite a bit of space was devoted to creating patterns, and the topic deserves
even more space. This is easily one of the more involved features of the Perl
language. One key concept is that a character can have multiple meanings. For
example, the plus sign can mean a plus sign in one instance (its literal
meaning), and in another it means match something one or more times (its
meta-meaning).
You learned about regular expression components and that they can be combined
in an infinite number of ways. Table 10.5 listed most of the meta-meanings for
different characters. You read about character classes, alternation,
quantifiers, anchors, pattern memory, word boundaries, and extended components.
The last section of the chapter was devoted to presenting numerous examples
of how to use regular expressions to accomplish specific goals. Each situation
was described, and a pattern that matched that situation was shown. Some
commentary was given for each example.
In the next chapter, you'll read about how to present information by using
formats. Formats are used to help relieve some of the programming burden from
the task of creating reports.
- Can you use variable interpolation with the translation operator?
- What happens if the pattern is empty?
- What variable does the substitution operator use as its default?
- Will the following line of code work?
m{.*];
- What is the /g option of the substitution operator used for?
- What does the \d meta-character sequence mean?
- What is the meaning of the dollar sign in the following pattern?
/AA[.<]$]ER/
- What is a word boundary?
- What will be displayed by the following program?
$_ = 'AB AB AC';
print m/c$/i;
- Write a pattern that matches either "top" or "topgun".
- Write a program that accepts input from STDIN and changes all instances of
the letter a into the letter b.
- Write a pattern that stores the first character to follow a tab into
pattern memory.
- Write a pattern that matches the letter g between 3 and 7
times.
- Write a program that finds repeated words in an input file and prints the
repeated word and the line number on which it was found.
- Create a character class for octal numbers.
- Write a program that uses the translation operator to remove repeated
instances of the tab character and then replaces the tab character with a
space character.
- Write a pattern that matches either "top" or "topgun"
using a zero-width positive look-ahead assertion.
ref : http://affy.blogspot.com/p5be/ch10.htm#The%20Matching%20Operator%20%28m//%29
Tidak ada komentar:
Posting Komentar