Regular expressions are complex enough that you could write a whole book on them (Mastering Regular Expressions by Jeffrey Friedl).
Simple matching
The simplest regular expressions are matching expressions. They perform tests using keywords likeif
, while
and
unless
. If you want to be really clever, you can use them with
and
and or
. A matching regexp will return a true
value if whatever you try to match occurs inside a string. To match a regular
expression against a string, use the special =~
operator:use 5.010; my $user_location = "I see thirteen black cats under a ladder."; say "Eek, bad luck!" if $user_location =~ /thirteen/;Notice the syntax of a regular expression: a string within a pair of slashes. The code
$user_location =~ /thirteen/
asks whether the
literal string thirteen
occurs anywhere inside
$user_location
. If it does, then the test evaluates true;
otherwise, it evaluates false.Metacharacters
A metacharacter is a character or sequence of characters that has special meaning. You may remember metacharacters in the context of double-quoted strings, where the sequence\n
means the newline
character, not a backslash and the character n
, and where
\t
means the tab character.Regular expressions have a rich vocabulary of metacharacters that let you ask interesting questions such as, "Does this expression occur at the end of a string?" or "Does this string contain a series of numbers?"
The two simplest metacharacters are
^
and $
. These
indicate "beginning of string" and "end of string," respectively. For example,
the regexp /^Bob/
will match "Bob was here," "Bob", and "Bobby."
It won't match "It's Bob and David," because Bob doesn't occur at the beginning
of the string. The $
character, on the other hand, matches at the
end of a string. The regexp /David$/
will match "Bob and David,"
but not "David and Bob." Here's a simple routine that will take lines from a
file and only print URLs that seem to indicate HTML files:for my $line (<$urllist>) { # "If the line starts with http: and ends with html...." print $line if $line =~ /^http:/ and $line =~ /html$/; }Another useful set of metacharacters is called wildcards. If you've ever used a Unix shell or the Windows DOS prompt, you're familiar with wildcards characters such
*
and ?
. For example, when
you type ls a*.txt
, you see all filenames that begin with the
letter a
and end with .txt
. Perl is a bit more
complex, but works on the same general principle.In Perl, the generic wildcard character is
.
. A period inside a
regular expression will match any character, except a newline. For
example, the regexp /a.b/
will match anything that contains
a
, another character that's not a newline, followed by
b
-- "aab," "a3b," "a b," and so forth.To match a literal metacharacter, escape it with a backslash. The regex
/Mr./
matches anything that contains "Mr" followed by another
character. If you only want to match a string that actually contains "Mr.," use
/Mr\./
.On its own, the
.
metacharacter isn't very useful, which is why
Perl provides three wildcard quantifiers: +
,
?
and *
. Each quantifier means something
different.The
+
quantifier is the easiest to understand: It means to
match the immediately preceding character or metacharacter one or more
times. The regular expression /ab+c/
will match "abc,"
"abbc," "abbbc", and so on.The
*
quantifier matches the immediately preceding character or
metacharacter zero or more times. This is different from the
+
quantifier! /ab*c/
will match "abc," "abbc," and so
on, just like /ab+c/
did, but it'll also match "ac," because there
are zero occurences of b
in that string.Finally, the
?
quantifier will match the preceding character
zero or one times. The regex /ab?c/
will match "ac" (zero
occurences of b
) and "abc" (one occurence of b
). It
won't match "abbc," "abbbc", and so on.The URL-matching code can be more concise with these metacharacters. This'll make it more concise. Instead of using two separate regular expressions (
/^http:/
and /html$/
), combine them into one regular
expression: /^http:.+html$/
. To understand what this does, read
from left to right: This regex will match any string that starts with
"http:" followed by one or more occurences of any character, and
ends with "html". Now the routine is:for my $line (<$urllist>) { print $line if $line =~ /^http:.+html$/; }Remember the
/^something$/
construction -- it's very
useful!Character classes
The special metacharacter,.
, matches any character except a
newline. It's common to want to match only specific types of characters. Perl
provides several metacharacters for this. \d
matches a single
digit, \w
will match any single "word" character (a letter, digit
or underscore), and \s
matches a whitespace character (space and
tab, as well as the \n
and \r
characters).These metacharacters work like any other character: You can match against them, or you can use quantifiers like
+
and *
. The
regex /^\s+/
will match any string that begins with whitespace,
and /\w+/
will match a string that contains at least one word.
(Though remember that Perl's definition of "word" characters includes digits
and the underscore, so whether you think _
or 25
are
words, Perl does!)One good use for
\d
is testing strings to see whether they
contain numbers. For example, you might need to verify that a string contains
an American-style phone number, which has the form 555-1212
. You
could use code like this:use 5.010; say "Not a phone number!" unless $phone =~ /\d\d\d-\d\d\d\d/;All those
\d
metacharacters make the regex hard to read.
Fortunately, Perl can do better. Use numbers inside curly braces to indicate a
quantity you want to match:use 5.010; say "Not a phone number!" unless $phone =~ /\d{3}-\d{4}/;The string
\d{3}
means to match exactly three numbers, and
\d{4}
matches exactly four digits. To use a range of numbers, you
can separate them with a comma; leaving out the second number makes the range
open-ended. \d{2,5}
will match two to five digits, and
\w{3,}
will match a word that's at least three characters
long.You can also invert the
\d
, \s
and
\w
metacharacters to refer to anything but that type of
character. \D
matches nondigits; \W
matches any
character that isn't a letter, digit, or underscore; and
\S
matches anything that isn't whitespace.If these metacharacters won't do what you want, you can define your own. You define a character class by enclosing a list of the allowable characters in square brackets. For example, a class containing only the lowercase vowels is
[aeiou]
. /b[aeiou]g/
will match any string that
contains "bag," "beg," "big," "bog", or "bug". Use dashes to indicate a range
of characters, like [a-f]
. (If Perl didn't give us the
\d
metacharacter, we could do the same thing with
[0-9]
.) You can combine character classes with quantifiers:use 5.010; say "This string contains at least two vowels in a row." if $string =~ /[aeiou]{2}/;You can also invert character classes by beginning them with the
^
character. An inverted character class will match anything you
don't list. [^aeiou]
matches every character except the
lowercase vowels. (Yes, ^
can also mean "beginning of string," so
be careful.)Flags
By default, regular expression matches are case-sensitive (that is,/bob/
doesn't match "Bob"). You can place flags after a
regexp to modify their behaviour. The most commonly used flag is
i
, which makes a match case-insensitive:use 5.010; my $greet = "Hey everybody, it's Bob and David!"; say "Hi, Bob!" if $greet =~ /bob/i;
Subexpressions
You might want to check for more than one thing at a time. For example, you're writing a "mood meter" that you use to scan outgoing e-mail for potentially damaging phrases. Use the pipe character|
to separate
different things you are looking for:use 5.010; # In reality, @email_lines would come from your email text, # but here we'll just provide some convenient filler. my @email_lines = ("Dear idiot:", "I hate you, you twit. You're a dope.", "I bet you mistreat your llama.", "Signed, Doug"); for my $check_line (@email_lines) { if ($check_line =~ /idiot|dope|twit|llama/) { say "Be careful! This line might contain something offensive:\n$check_line"; }The matching expression
/idiot|dope|twit|llama/
will be true if
"idiot," "dope," "twit" or "llama" show up anywhere in the string.One of the more interesting things you can do with regular expressions is subexpression matching, or grouping. A subexpression is another, smaller regex buried inside your larger regexp within matching parentheses. The string that caused the subexpression to match will be stored in the special variable
$1
. This can make your mood meter more explicit about
the problems with your e-mail:for my $check_line (@email_lines) { if ($check_line =~ /(idiot|dope|twit|llama)/) { say "Be careful! This line contains the offensive word '$1':\n$check_line"; }Of course, you can put matching expressions in your subexpression. Your mood watch program can be extended to prevent you from sending e-mail that contains more than three exclamation points in a row. The special
{3,}
quantifier will make sure to get all the exclamation points.for my $check_line (@email_lines) { if ($check_line =~ /(!{3,})/) { say "Using punctuation like '$1' is the sign of a sick mind:\n$check_line"; } }If your regex contains more than one subexpression, the results will be stored in variables named
$1
, $2
, $3
and
so on. Here's some code that will change names in "lastname, firstname" format
back to normal:my $name = 'Wall, Larry'; $name =~ /(\w+), (\w+)/; # $1 contains last name, $2 contains first name $name = "$2 $1"; # $name now contains "Larry Wall"You can even nest subexpressions inside one another -- they're ordered as they open, from left to right. Here's an example of how to retrieve the full time, hours, minutes and seconds separately from a string that contains a timestamp in
hh:mm:ss
format. (Notice the use of the
{1,2}
quantifier to match a timestamp like "9:30:50".)my $string = "The time is 12:25:30 and I'm hungry."; if ($string =~ /((\d{1,2}):(\d{2}):(\d{2}))/) { my @time = ($1, $2, $3, $4); }Here's a hint that you might find useful: You can assign to a list of scalar values whenever you're assigning from a list. If you prefer to have readable variable names instead of an array, try using this line instead:
my ($time, $hours, $minutes, $seconds) = ($1, $2, $3, $4);Assigning to a list of variables when you're using subexpressions happens often enough that Perl gives you a handy shortcut. In list context, a successful regular expression match returns its captured variables in the order in which they appear within the regexp:
my ($time, $hours, $minutes, $seconds) = $string =~ /((\d{1,2}):(\d{2}):(\d{2}))/;Counting parentheses to see where one group begins and another group ends is troublesome though. Perl 5.10 added a new feature, lovingly borrowed from other languages, where you can give names to capture groups and access the captured values through the special hash
%+
. This is most obvious
by example:my $name = 'Wall, Larry'; $name =~ /(?<last>\w+), (?<first>\w+)/; # %+ contains all named captures $name = "$+{last} $+{first}"; # $name now contains "Larry Wall"There's a common mistake related to captures, namely assuming that
$1
and %+
et al will hold meaningful values if the
match failed:my $name = "Damian Conway"; # no comma, so the match will fail! $name =~ /(?<last>\w+), (?<first>\w+)/; # and there's nothing in the capture buffers $name = "$+{last} $+{first}"; # $name now contains a blank spaceAlways check the success or failure of your regular expression when working with captures!
my $name = "Damian Conway"; $name = "$+{last} $+{first}" if $name =~ /(?<last>\w+), (?<first>\w+)/;
Watch out!
Regular expressions have two othertraps that generate bugs in your Perl programs: They always start at the beginning of the string, and quantifiers always match as much of the string as possible.Here's some simple code for counting all the numbers in a string and showing them to the user. It uses
while
to loop over the string, matching
over and over until it has counted all the numbers.use 5.010; my $number = "Look, 200 5-sided, 4-colored pentagon maps."; my $number_count = 0; while ($number =~ /(\d+)/) { say "I found the number $1.\n"; $number_count++; } say "There are $number_count numbers here.\n";This code is actually so simple it doesn't work! When you run it, Perl will print
I found the number 200
over and over again. Perl always
begins matching at the beginning of the string, so it will always find the 200,
and never get to the following numbers.You can avoid this by using the
g
flag with your regex. This
flag will tell Perl to remember where it was in the string when it returns to
it (due to a while
loop). When you insert the g
flag,
the code becomes:use 5.010; my $number = "Look, 200 5-sided, 4-colored pentagon maps."; my $number_count = 0; while ($number =~ /(\d+)/g) { say "I found the number $1.\n"; $number_count++; } say "There are $number_count numbers here.\n";Now you get the expected results:
I found the number 200. I found the number 5. I found the number 4. There are 3 numbers here.The second trap is that a quantifier will always match as many characters as it can. Look at this example code, but don't run it yet:
use 5.010; my $book_pref = "The cat in the hat is where it's at.\n"; say $+{match} if $book_pref =~ /(?<match>cat.*at)/;Take a guess: What's in
$+{match}
right now? Now run the code.
Does this seem counterintuitive?The matching expression
cat.*at
is greedy. It contains
cat in the hat is where it's at
because that's the
longest string that matches. Remember, read left to right: "cat,"
followed by any number of characters, followed by "at." If you want to match
the string cat in the hat
, you have to rewrite your regexp so it
isn't as greedy. There are two ways to do this:- Make the match more precise (try
/(?<match>cat.*hat)/
instead). Of course, this still might not work -- try using this regexp againstThe cat in the hat is who I hate
. - Use a
?
character after a quantifier to specify non-greedy matching..*?
instead of.*
means that Perl will try to match the smallest string possible instead of the largest:
# Now we get "cat in the hat" in $+{match}. say $+{match} if $book_pref =~ /(?<match>cat.*?at)/;
Search and replace
Regular expressions can do something else for you: replacing.If you've ever used a text editor or word processor, you've probably used its search-and-replace function. Perl's regexp facilities include something similar, the
s///
operator: s/regex/replacement
string/
. If the string you're testing matches regex, then
whatever matched is replaced with the contents of replacement string.
For instance, this code will change a cat into a dog:use 5.010; my $pet = "I love my cat."; $pet =~ s/cat/dog/; say $pet;You can also use subexpressions in your matching expression, and use the variables
$1
, $2
and so on, that they create. The
replacement string will substitute these, or any other variables, as if it were
a double-quoted string. Remember the code for changing Wall, Larry
into Larry Wall
? It makes a fine single s///
statement!my $name = 'Wall, Larry'; $name =~ s/(\w+), (\w+)/$2 $1/; # "Larry Wall"You don't have to worry about using captures if the match fails; the substitution won't take place. Of course, named captures work equally well:
my $name = 'Wall, Larry'; $name =~ s/(?<last>\w+), (?<first>\w+)/$+{first} $+{last}/; # "Larry Wall"
s///
can take flags, just like matching expressions. The two
most important flags are g
(global) and i
(case-insensitive). Normally, a substitution will only happen once,
but specifying the g
flag will make it happen as long as the regex
matches the string. Try this code with and without the g
flag:use 5.010; my $pet = "I love my cat Sylvester, and my other cat Bill.\n"; $pet =~ s/cat/dog/g; say $pet;Notice that without the
g
flag, Bill avoids
substitution-related polymorphism.The
i
flag works just as it does in matching expressions: It
forces your matching search to be case-insensitive.Maintainability
Once you start to see how patterns describe text, everything so far is reasonably simple. Regexps may start simple, but often they grow in to larger beasts. There are two good techniques for making regexps more readable: adding comments and factoring them into smaller pieces.The
x
flag allows you to use whitespace and comments within
regexps, without it being significant to the pattern:my ($time, $hours, $minutes, $seconds) = $string =~ /( # capture entire match (\d{1,2}) # one or two digits for the hour : (\d{2}) # two digits for the minutes : (\d{2}) # two digits for the seconds ) /x;That may be a slight improvement for the previous version of this regexp, but this technique works even better for complex regexps. Be aware that if you do need to match whitespace within the pattern, you must use
\s
or an equivalent.Adding comments is helpful, but sometimes giving a name to a particular piece of code is sufficient clarification. The
qr//
operator
compiles but does not execute a regexp, producing a regexp object that you can
use inside a match or substitution:my $two_digits = qr/\d{2}/; my ($time, $hours, $minutes, $seconds) = $string =~ /( # capture entire match (\d{1,2}) # one or two digits for the hour : ($two_digits) # minutes : ($two_digits) # seconds ) /x;Of course, you can use all of the previous techniques as well:
use 5.010; my $two_digits = qr/\d{2}/; my $one_or_two_digits = qr/\d{1,2}/; my ($time, $hours, $minutes, $seconds) = $string =~ /(?<time> (?<hours> $one_or_two_digits) : (?<minutes> $two_digits) : (?<seconds> $two_digits) ) /x;Note that the captures are available through
%+
as well as in
the list of values returned from the match.Putting it all together
Regular expressions have many practical uses. Consider a httpd log analyzer for an example. One of the play-around items in the previous article was to write a simple log analyzer. You can make it more interesting; how about a log analyzer that will break down your log results by file type and give you a list of total requests by hour.(Complete source code.)
Here's a sample line from a httpd log:
127.12.20.59 - - [01/Nov/2000:00:00:37 -0500] "GET /gfx2/page/home.gif HTTP/1.1" 200 2285The first task is split this into fields. Remember that the
split()
function takes a regular expression as its first argument.
Use /\s/
to split the line at each whitespace character:my @fields = split /\s/, $line;This gives 10 fields. The interesting fields are the fourth field (time and date of request), the seventh (the URL), and the ninth and 10th (HTTP status code and size in bytes of the server response).
Step one is canonicalization: turning any request for a URL that ends in a slash (like
/about/
) into a request for the index page from that
directory (/about/index.html
). Remember to escape the slashes so
that Perl doesn't consider them the terminating characters of the match or
substitution:$fields[6] =~ s/\/$/\/index.html/;This line is difficult to read; it suffers from leaning-toothpick syndrome. Here's a useful trick for avoiding the leaning-toothpick syndrome: replace the slashes that mark regular expressions and
s///
statements with any other matching pair of characters, such
as {
and }
. This allows you to write a more legible
regex where you don't need to escape the slashes:$fields[6] =~ s{/$}{/index.html};(To use this syntax with a matching expression, put a
m
in
front of it. /foo/
becomes m{foo}
.)Step two is to assume that any URL request that returns a status code of 200 (a successful request) is a request for the file type of the URL's extension (a request for /gfx/page/home.gif returns a GIF image). Any URL request without an extension returns a plain-text file. Remember that the period is a metacharacter, so escape it!
if ($fields[8] eq '200') { if ($fields[6] =~ /\.([a-z]+)$/i) { $type_requests{$1}++; } else { $type_requests{txt}++; } }Next, retrieve the hour when each request took place. The hour is the first string in
$fields[3]
that will be two digits surrounded
by colons, so all you need to do is look for that. Remember that Perl will stop
when it finds the first match in a string:# Log the hour of this request $fields[3] =~ /:(\d{2}):/; $hour_requests{$1}++;Finally, rewrite the original
report()
sub.
We're doing the same thing over and over (printing a section header
and the contents of that section), so we'll break that out into a
new sub. We'll call the new sub report_section()
:sub report { print "Total bytes requested: ", $bytes, "\n"; print "\n"; report_section("URL requests:", %url_requests); report_section("Status code results:", %status_requests); report_section("Requests by hour:", %hour_requests); report_section("Requests by file type:", %type_requests); }The new
report_section()
sub is very simple:sub report_section { my ($header, %types) = @_; say $header; for my $type (sort keys %types) { say "$type: $types{$type}"; } print "\n"; }The
keys
operator returns a list of the keys in the
%types
hash, and the sort
operator puts them in
alphabetic order. The next article will explain sort
in more
detail.Play around!
As usual, here are some sample exercises.- A rule of good writing is "avoid the passive voice."
Instead of The report was read by Carl, say Carl read
the report. Write a program that reads a file of sentences
(one per line), detects and eliminates the passive voice, and
prints the result. (Don't worry about irregular verbs or
capitalization, though.)
Sample solution. Sample test sentences. - You have a list of phone numbers. The list is messy, and the
only thing you know is that there are either seven or 10 digits in
each number (the area code is optional), and if there's an
extension, it will show up after an "x" somewhere on the line.
"416 555-1212," "5551300X40" and "(306) 555.5000 ext 40" are
all possible. Write a
fix_phone()
sub that will turn all of these numbers into the standard format "(123) 555-1234" or "(123) 555-1234 Ext 100," if there is an extension. Assume that the default area code is "123".
Tidak ada komentar:
Posting Komentar