One of the most useful features of
Perl (if not the most useful feature) is its powerful string
manipulation facilities. At the heart of this is the regular expression
(RE) which is shared by many other UNIX utilities.
A regular expression is contained in
slashes, and matching occurs with the =~ operator. The following
expression is true if the string the appears in variable $sentence.
$sentence
=~ /the/
The RE is case sensitive, so if
$sentence
= "The quick brown fox";
then the above match will be false.
The operator !~ is used for spotting a non-match. In the above example
$sentence
!~ /the/
We could use a conditional as
if
($sentence =~ /under/)
{
print "We're talking about
rugby\n";
}
which would print out a message if
we had either of the following
$sentence
= "Up and under";
$sentence
= "Best winkles in Sunderland";
But it's often much easier if we
assign the sentence to the special variable $_ which is of course a
scalar. If we do this then we can avoid using the match and non-match operators
and the above can be written simply as
if
(/under/)
{
print "We're talking about
rugby\n";
}
The $_ variable is the
default for many Perl operations and tends to be used very heavily.
In an RE there are plenty of special
characters, and it is these that both give them their power and make them
appear very complicated. It's best to build up your use of REs slowly; their
creation can be something of an art form.
Here are some special RE characters
and their meaning
. # Any single character except a newline
^ # The beginning of the line or string
$ # The end of the line or string
* # Zero or more of the last character
+ # One or more of the last character
? # Zero or one of the last character
and here are some example matches.
Remember that should be enclosed in /.../ slashes to be used.
t.e # t followed by anthing followed by e
# This will match the
# tre
# tle
# but not te
#
tale
^f # f at the beginning of a line
^ftp # ftp at the beginning of a line
e$ # e at the end of a line
tle$ # tle at the end of a line
und* # un followed by zero or more d characters
# This will match un
# und
# undd
# unddd (etc)
.* # Any string without a newline. This is
because
# the . matches anything except a
newline and
# the * means zero or more of these.
^$ # A line with nothing in it.
There are even more options. Square
brackets are used to match any one of the characters inside them. Inside square
brackets a - indicates "between" and a ^ at the
beginning means "not":
[qjk] # Either q or j or k
[^qjk] # Neither q nor j nor k
[a-z] # Anything from a to z inclusive
[^a-z] # No lower case letters
[a-zA-Z] # Any letter
[a-z]+ # Any non-zero sequence of lower case
letters
At this point you can probably skip
to the end and do at least most of the exercise. The rest is mostly just for
reference.
A vertical bar | represents
an "or" and parentheses (...) can be used to group
things together:
jelly|cream # Either jelly or cream
(eg|le)gs # Either eggs or legs
(da)+ # Either da or dada or dadada or...
Here are some more special
characters:
\n # A newline
\t # A tab
\w # Any alphanumeric (word)
character.
# The same as [a-zA-Z0-9_]
\W # Any non-word character.
# The same as [^a-zA-Z0-9_]
\d # Any digit. The same as [0-9]
\D # Any non-digit. The same as [^0-9]
\s # Any whitespace character: space,
# tab, newline, etc
\S # Any non-whitespace character
\b # A word boundary, outside [] only
\B # No word boundary
Clearly characters like $, |,
[, ), \, / and so on are peculiar cases in regular
expressions. If you want to match for one of those then you have to preceed it
by a backslash. So:
\| # Vertical bar
\[ # An open square bracket
\) # A closing parenthesis
\* # An asterisk
\^ # A carat symbol
\/ # A slash
\\ # A backslash
and so on.
As was mentioned earlier, it's
probably best to build up your use of regular expressions slowly. Here are a
few examples. Remember that to use them for matching they should be put in /.../
slashes
[01] # Either "0" or
"1"
\/0 # A division by zero: "/0"
\/
0 # A division by zero with a
space: "/ 0"
\/\s0 # A division by zero with a
whitespace:
#
"/ 0" where the space may be a tab etc.
\/
*0 # A division by zero with
possibly some
# spaces: "/0" or
"/ 0" or "/ 0" etc.
\/\s*0 # A division by zero with possibly some
# whitespace.
\/\s*0\.0* # As the previous one, but with decimal
# point and maybe some 0s after
it. Accepts
# "/0." and
"/0.0" and "/0.00" etc and
# "/ 0." and
"/ 0.0" and "/ 0.00" etc.
Exercise
Previously your program counted
non-empty lines. Alter it so that instead of counting non-empty lines it counts
only lines with
- the letter x
- the string the
- the string the which may or may not have a capital t
- the word the with or without a capital. Use \b to detect word boundaries.
In each case the program should
print out every line, but it should only number those specified. Try to use the
$_ variable to avoid using the =~ match operator explicitly.
Substitution and translation
As well as identifying regular expressions Perl can make substitutions based on those matches. The way to do this is to use the s function which is designed to mimic the way substitution is done in the vi text editor. Once again the match operator is used, and once again if it is omitted then the substitution is assumed to take place with the $_ variable.
To replace an occurrence of london by London in the string $sentence we use the expression
$sentence =~ s/london/London/and to do the same thing with the $_ variable just
s/london/London/Notice that the two regular expressions (london and London) are surrounded by a total of three slashes. The result of this expression is the number of substitutions made, so it is either 0 (false) or 1 (true) in this case.
Options
This example only replaces the first occurrence of the string, and it may be that there will be more than one such string we want to replace. To make a global substitution the last slash is followed by a g as follows:s/london/London/gwhich of course works on the $_ variable. Again the expression returns the number of substitutions made, which is 0 (false) or something greater than 0 (true). If we want to also replace occurrences of lOndon, lonDON, LoNDoN and so on then we could use
s/[Ll][Oo][Nn][Dd][Oo][Nn]/London/gbut an easier way is to use the i option (for "ignore case"). The expression
s/london/London/giwill make a global substitution ignoring case. The i option is also used in the basic /.../ regular expression match.
Remembering patterns
It's often useful to remember patterns that have been matched so that they can be used again. It just so happens that anything matched in parentheses gets remembered in the variables $1,...,$9. These strings can also be used in the same regular expression (or substitution) by using the special RE codes \1,...,\9. For example$_ = "Lord Whopper of Fibbing"; s/([A-Z])/:\1:/g; print "$_\n";will replace each upper case letter by that letter surrounded by colons. It will print :L:ord :W:hopper of :F:ibbing. The variables $1,...,$9 are read-only variables; you cannot alter them yourself. As another example, the test
if (/(\b.+\b) \1/) { print "Found $1 repeated\n"; }will identify any words repeated. Each \b represents a word boundary and the .+ matches any non-empty string, so \b.+\b matches anything between two word boundaries. This is then remembered by the parentheses and stored as \1 for regular expressions and as $1 for the rest of the program. The following swaps the first and last characters of a line in the $_ variable:
s/^(.)(.*)(.)$/\3\2\1/The ^ and $ match the beginning and end of the line. The \1 code stores the first character; the \2 code stores everything else up the last character which is stored in the \3 code. Then that whole line is replaced with \1 and \3 swapped round. After a match, you can use the special read-only variables $` and $& and $' to find what was matched before, during and after the seach. So after
$_ = "Lord Whopper of Fibbing"; /pp/;all of the following are true. (Remember that eq is the string-equality test.)
$` eq "Lord Wo"; $& eq "pp"; $' eq "er of Fibbing";Finally on the subject of remembering patterns it's worth knowing that inside of the slashes of a match or a substitution variables are interpolated. So
$search = "the"; s/$search/xxx/g;will replace every occurrence of the with xxx. If you want to replace every occurence of there then you cannot do s/$searchre/xxx/ because this will be interpolated as the variable $searchre. Instead you should put the variable name in curly braces so that the code becomes
$search = "the"; s/${search}re/xxx/;
Translation
The tr function allows character-by-character translation. The following expression replaces each a with e, each b with d, and each c with f in the variable $sentence. The expression returns the number of substitutions made.$sentence =~ tr/abc/edf/Most of the special RE codes do not apply in the tr function. For example, the statement here counts the number of asterisks in the $sentence variable and stores that in the $count variable.
$count = ($sentence =~ tr/*/*/);However, the dash is still used to mean "between". This statement converts $_ to upper case.
tr/a-z/A-Z/;
Exercise
Your current program should count lines of a file which contain a certain string. Modify it so that it counts lines with double letters (or any other double character). Modify it again so that these double letters appear also in parentheses. For example your program would produce a line like this among others:023 Amp, James Wa(tt), Bob Transformer, etc. These pion(ee)rs conducted manyTry to get it so that all pairs of letters are in parentheses, not just the first pair on each line. For a slightly more interesting program you might like to try the following. Suppose your program is called countlines. Then you would call it with
./countlinesHowever, if you call it with several arguments, as in
./countlines first second etcthen those arguments are stored in the array @ARGV. In the above example we have $ARGV[0] is first and $ARGV[1] is second and $ARGV[2] is etc. Modify your program so that it accepts one argument and counts only those lines with that string. It should also put occurrences of this string in paretheses. So
./countlines thewill output something like this line among others:
019 But (the) greatest Electrical Pioneer of (the)m all was Thomas Edison, who
Split
A very useful function in Perl is split, which splits up a string and places it into an array. The function uses a regular expression and as usual works on the $_ variable unless otherwise specified.
The split function is used like this:
$info = "Caine:Michael:Actor:14, Leafy Drive"; @personal = split(/:/, $info); which has the same overall effect as @personal = ("Caine", "Michael", "Actor", "14, Leafy Drive"); If we have the information stored in the $_ variable then we can just use this instead @personal = split(/:/); If the fields are divided by any number of colons then we can use the RE codes to get round this. The code
$_ = "Capes:Geoff::Shot putter:::Big Avenue"; @personal = split(/:+/); is the same as @personal = ("Capes", "Geoff", "Shot putter", "Big Avenue"); But this: $_ = "Capes:Geoff::Shot putter:::Big Avenue"; @personal = split(/:/); would be like @personal = ("Capes", "Geoff", "", "Shot putter", "", "", "Big Avenue"); A word can be split into characters, a sentence split into words and a paragraph split into sentences:
@chars = split(//, $word); @words = split(/ /, $sentence); @sentences = split(/\./, $paragraph); In the first case the null string is matched between each character, and that is why the @chars array is an array of characters - ie an array of strings of length 1.
Exercise
A useful tool in natural language processing is concordance. This allows a specific string to be displayed in its immediate context whereever it appears in a text. For example, a concordance program identifying the target string the might produce some of the following output. Notice how the occurrences of the target string line up vertically. discovered (this is the truth) that when he t kinds of metal to the leg of a frog, an e rrent developed and the frog's leg kicked, longer attached to the frog, which was dea normous advances in the field of amphibian ch it hop back into the pond -- almost. Bu ond -- almost. But the greatest Electrical ectrical Pioneer of them all was Thomas Edi This exercise is to write such a program. Here are some tips:- Read the entire file into array (this obviously isn't useful in general because the file may be extremely large, but we won't worry about that here). Each item in the array will be a line of the file.
- When the chop function is used on an array it chops off the last character of every item in the array.
- Recall that you can join the whole array together with a statement like $text = "@lines";
- Use the target string as delimiter for splitting the text. (Ie, use the target string in place of the colon in our previous examples.) You should then have an array of all the strings between the target strings.
- For each array element in turn, print it out, print the target string, and then print the next array element.
- Recall that the last element of an array @food has index $#food.
Associative arrays
Ordinary list arrays allow us to access their element by number. The first element of array @food is $food[0]. The second element is $food[1], and so on. But Perl also allows us to create arrays which are accessed by string. These are called associative arrays.
To define an associative array we use the usual parenthesis notation, but the array itself is prefixed by a % sign. Suppose we want to create an array of people and their ages. It would look like this:
%ages = ("Michael Caine", 39, "Dirty Den", 34, "Angie", 27, "Willy", "21 in dog years", "The Queen Mother", 108);Now we can find the age of people with the following expressions
$ages{"Michael Caine"}; # Returns 39 $ages{"Dirty Den"}; # Returns 34 $ages{"Angie"}; # Returns 27 $ages{"Willy"}; # Returns "21 in dog years" $ages{"The Queen Mother"}; # Returns 108Notice that like list arrays each % sign has changed to a $ to access an individual element because that element is a scalar. Unlike list arrays the index (in this case the person's name) is enclosed in curly braces, the idea being that associative arrays are fancier than list arrays. An associative array can be converted back into a list array just by assigning it to a list array variable. A list array can be converted into an associative array by assigning it to an associative array variable. Ideally the list array will have an even number of elements:
@info = %ages; # @info is a list array. It # now has 10 elements $info[5]; # Returns the value 27 from # the list array @info %moreages = @info; # %moreages is an associative # array. It is the same as %ages
Operators
Associative arrays do not have any order to their elements (they are just like hash tables) but is it possible to access all the elements in turn using the keys function and the values function:foreach $person (keys %ages) { print "I know the age of $person\n"; } foreach $age (values %ages) { print "Somebody is $age\n"; }When keys is called it returns a list of the keys (indices) of the associative array. When values is called it returns a list of the values of the array. These functions return their lists in the same order, but this order has nothing to do with the order in which the elements have been entered. When keys and values are called in a scalar context they return the number of key/value pairs in the associative array.
There is also a function each which returns a two element list of a key and its value. Every time each is called it returns another key/value pair:
while (($person, $age) = each(%ages)) { print "$person is $age\n"; }
Environment variables
When you run a perl program, or any script in UNIX, there will be certain environment variables set. These will be things like USER which contains your username and DISPLAY which specifies which screen your graphics will go to. When you run a perl CGI script on the World Wide Web there are environment variables which hold other useful information. All these variables and their values are stored in the associative %ENV array in which the keys are the variable names. Try the following in a perl program:print "You are called $ENV{'USER'} and you are "; print "using display $ENV{'DISPLAY'}\n";
Subroutines
Like any good programming langauge Perl allows the user to define their own functions, called subroutines. They may be placed anywhere in your program but it's probably best to put them all at the beginning or all at the end. A subroutine has the form
sub mysubroutine { print "Not a very interesting routine\n"; print "This does the same thing every time\n"; } regardless of any parameters that we may want to pass to it. All of the following will work to call this subroutine. Notice that a subroutine is called with an & character in front of the name: &mysubroutine; # Call the subroutine &mysubroutine($_); # Call it with a parameter &mysubroutine(1+2, $_); # Call it with two parameters
Tidak ada komentar:
Posting Komentar