Pattern matching against strings
Regular expressions are a computer science concept where simple
patterns describe the format of text. Pattern matching is the process of
applying these patterns to actual text to look for matches.Most modern regular expression facilities are more powerful than traditional regular expressions due to the influence of languages such as Perl, but the short-hand term regex has stuck and continues to mean regular expression-like pattern matching.
In Perl 6, although they are capable of much more than regular languages, we continue to call them regexes.
Lexical conventions
Perl 6 has special syntax for writing regexes:m/abc/; # a regex that is immediately matched against $_
rx/abc/; # a Regex object
/abc/; # a Regex object
m{abc};
rx{abc};
Note that neither the colon
:
nor round parentheses can be delimiters; the colon is forbidden because it clashes with adverbs, such as rx:i/abc/
(case insensitive regexes), and round parentheses indicate a function call instead.Whitespace in regexes is generally ignored (except with the
:s
or :sigspace
adverb).As in the rest of Perl 6, comments in regexes start with a hash character
#
and go to the end of the current line.Literals
The simplest case of a regex is a constant string. Matching a string against that regex searches for that string:if 'properly' ~~ m/ perl / {
say "'properly' contains 'perl'";
}
_
are literal matches. All other characters must either be escaped with a backslash (for example \:
to match a colon), or included in quotes:/ 'two words' / # matches 'two words' including the blank
/ "a:b" / # matches 'a:b' including the colon
/ '#' / # matches a hash character
if 'abcdef' ~~ / de / {
say ~$/; # de
say $/.prematch; # abc
say $/.postmatch; # f
say $/.from; # 3
say $/.to; # 5
};
$/
variable and are also returned from the match. The result is of type Match if the match was successful; otherwise it is Nil.Wildcards and character classes
Dot to match any character
An unescaped dot.
in a regex matches any single character.So these all match:
'perl' ~~ /per./; # matches the whole string
'perl' ~~ / per . /; # the same; whitespace is ignored
'perl' ~~ / pe.l /; # the . matches the r
'speller' ~~ / pe.l/; # the . matches the first l
'perl' ~~ /. per /
per
in the target string.Backslashed, predefined character classes
There are predefined character classes of the form\w
. Its negation is written with an upper-case letter, \W
.- \d and \D
\d
matches a single digit (Unicode property N
) and \D
matches a single character that is not a digit.'ab42' ~~ /\d/ and say ~$/; # 4
'ab42' ~~ /\D/ and say ~$/; # a
\d
, but also digits from other scripts.Examples for digits are:
U+0035 5 DIGIT FIVE
U+07C2 ߂ NKO DIGIT TWO
U+0E53 ๓ THAI DIGIT THREE
U+1B56 ᭖ BALINESE DIGIT SIX
- \h and \H
\h
matches a single horizontal whitespace character. \H
matches a single character that is not a horizontal whitespace character.Examples for horizontal whitespace characters are
U+0020 SPACE
U+00A0 NO-BREAK SPACE
U+0009 CHARACTER TABULATION
U+2001 EM QUAD
\v
, and \s
matches any kind of whitespace.- \n and \N
\n
matches a single, logical newline character. \n
is supposed to also match a Windows CR LF codepoint pair; though it is
unclear whether the magic happens at the time that external data is
read, or at regex match time. \N
matches a single character that's not a logical newline.- \s and \S
\s
matches a single whitespace character. \S
matches a single character that is not whitespace.if 'contains a word starting with "w"' ~~ / w \S+ / {
say ~$/; # word
}
- \t and \T
\t
matches a single tab/tabulation character, U+0009
. (Note that exotic tabs like the U+000B VERTICAL TABULATION
character are not included here). \T
matches a single character that is not a tab.- \v and \V
\v
matches a single vertical whitespace character. \V
matches a single character that is not vertical whitespace.Examples for vertical whitespace characters:
U+000A LINE FEED
U+000B VERTICAL TABULATION
U+000C CARRIAGE RETURN
U+0085 NEXT LINE
U+2029 PARAGRAPH SEPARATOR
\s
to match any kind of whitespace, not just vertical whitespace.- \w and \W
\w
matches a single word character, i.e. a letter (Unicode category L), a digit or an underscore. \W
matches a single character that isn't a word character.Examples of word characters:
0041 A LATIN CAPITAL LETTER A
0031 1 DIGIT ONE
03B4 δ GREEK SMALL LETTER DELTA
03F3 ϳ GREEK LETTER YOT
0409 Љ CYRILLIC CAPITAL LETTER LJE
Unicode properties
The character classes so far are mostly for convenience; a more systematic approach is the use of Unicode properties. They are called in the form <:property>
, where property
can be a short or long Unicode property name.The following list is stolen from the Perl 5 perlunicode documentation:
Short | Long |
---|---|
L | Letter |
LC | Cased_Letter |
Lu | Uppercase_Letter |
Ll | Lowercase_Letter |
Lt | Titlecase_Letter |
Lm | Modifier_Letter |
Lo | Other_Letter |
M | Mark |
Mn | Nonspacing_Mark |
Mc | Spacing_Mark |
Me | Enclosing_Mark |
N | Number |
Nd | Decimal_Number (also Digit) |
Nl | Letter_Number |
No | Other_Number |
P | Punctuation (also Punct) |
Pc | Connector_Punctuation |
Pd | Dash_Punctuation |
Ps | Open_Punctuation |
Pe | Close_Punctuation |
Pi | Initial_Punctuation |
(may behave like Ps or Pe depending on usage) | |
Pf | Final_Punctuation |
(may behave like Ps or Pe depending on usage) | |
Po | Other_Punctuation |
S | Symbol |
Sm | Math_Symbol |
Sc | Currency_Symbol |
Sk | Modifier_Symbol |
So | Other_Symbol |
Z | Separator |
Zs | Space_Separator |
Zl | Line_Separator |
Zp | Paragraph_Separator |
C | Other |
Cc | Control (also Cntrl) |
Cf | Format |
Cs | Surrogate |
Co | Private_Use |
Cn | Unassigned |
<:Lu>
matches a single, upper-case letter.Negation works as
<:!category>
, so <:!Lu>
matches a single character that isn't an upper-case letter.Several categories can be combined with one of these infix operators:
Operator | Meaning |
---|---|
+ | set union |
| | set union |
& | set intersection |
- | set difference (first minus second) |
^ | symmetric set intersection / XOR |
<:Ll+:N>
or <:Ll+:Number>
or <+ :Lowercase_Letter + :Number>
.It is also possible to group categories and sets of categories with parentheses, e.g.:
'perl6' ~~ m{\w+(<:Ll+:N>)} # 0 => 「6」
Enumerated character classes and ranges
Sometimes the pre-existing wildcards and character classes are not enough. Fortunately, defining your own is fairly simple. Between<[ ]>
,
you can put any number of single characters and ranges of characters
(expressed with two dots between the end points), with or without
whitespace."abacabadabacaba" ~~ / <[ a .. c 1 2 3 ]> /
< >
you can also use the same operators for categories (+
, |
, &
, -
, ^
)
to combine multiple range definitions and even mix in some of the
unicode categories above. You are also allowed to write the backslashed
forms for character classes between the [ ]
./ <[\d] - [13579]> /
# not quite the same as
/ <[02468]>
# because the first one also contains "weird" unicodey digits
-
after the opening angle:say 'no quotes' ~~ / <-[ " ]> + /; # matches characters except "
say '"in quotes"' ~~ / '"' <-[ " ]> * '"'/;
*
and +
in the examples above are explained in section Quantifier.Just as you can use the
-
for both set difference and negation of a single value, you can also explicitly put a +
in front:/ <+[123]> / # same as <[123]>
Quantifiers
A quantifier makes a preceding atom match not exactly once, but rather a variable number of times. For examplea+
matches one or more a
characters.Quantifiers bind tighter than concatenation, so
ab+
matches one a
followed by one or more b
s. This is different for quotes, so 'ab'+
matches the strings ab
, abab
, ababab
etc.One or more: +
The+
quantifier makes the preceding atom match one or more times, with no upper limit.For example to match strings of the form
key=value
, you can write a regex like this:/ \w+ '=' \w+ /
Zero or more: *
The*
quantifier makes the preceding atom match zero or more times, with no upper limit.For example to allow optional whitespace between
a
and b
you can write/ a \s* b /
Zero or one match: ?
The?
quantifier makes the preceding atom match zero or once.General quantifier: ** min..max
To quantify an atom an arbitrary number of times, you can write e.g.a ** 2..5
to match the character a
at least twice and at most 5 times. say Bool('a' ~~ /a ** 2..5/); #-> False
say Bool('aaa' ~~ /a ** 2..5/); #-> True
a ** 5
matches a
exactly five times. say Bool('aaaaa' ~~ /a ** 5/); #-> True
Modified quantifier: %
To more easily match things like comma separated values, you can tack on a%
modifier to any of the above quantifiers to specify a separator than must occur between each of the matches. So, for example a+ % ','
will match a
or a,a
or a,a,a
or so on, but it will not match a,
or a,a,
. To match those as well, you may use %%
instead of %
.Alternation
To match one of several possible alternatives, separate them by||
; the first matching alternative wins.For example,
ini
files have the following form:[section]
key = value
ini
file, it can be either a section or a key-value pair and the regex would be (to a first approximation):/ '[' \w+ ']' || \S+ \s* '=' \s* \S* /
=
, followed again by optional whitespace, followed by another string of non-whitespace characters.Anchors
The regex engine tries to find a match inside a string by searching from left to right.say so 'properly' ~~ / perl/; # True
# ^^^^
Anchors need to match successfully in order for the whole regex to match but they do not use up characters while matching.
^
, Start of String
The ^
assertion only matches at the start of the string.say so 'properly' ~~ /perl/; # True
say so 'properly' ~~ /^ perl/; # False
say so 'perly' ~~ /^ perl/; # True
say so 'perl' ~~ /^ perl/; # True
^^
, Start of Line and $$
, End of Line
The ^^
assertion matches at the start of a logical line. That is, either at the start of the string, or after a newline character.$$
matches only at the end of a logical line, that is,
before a newline character, or at the end of the string when the last
character is not a newline character.(To understand the following example, it is important to know that the
q:to/EOS/...EOS
"heredoc" syntax removes leading indention to the same level as the EOS
marker, so that the first, second and last lines have no leading space
and the third and fourth lines have two leading spaces each). my $str = q:to/EOS/;
There was a young man of Japan
Whose limericks never would scan.
When asked why this was,
He replied "It's because
I always try to fit as many syllables into the last line as ever I possibly can."
EOS
say so $str ~~ /^^ There/; # True (start of string)
say so $str ~~ /^^ limericks/; # False (not at the start of a line)
say so $str ~~ /^^ I/; # True (start of the last line)
say so $str ~~ /^^ When/; # False (there are blanks between
# start of line and the "When")
say so $str ~~ / Japan $$/; # True (end of first line)
say so $str ~~ / scan $$/; # False (there is a . between "scan"
# and the end of line)
say so $str ~~ / '."' $$/; # True (at the last line)
<<
and >>
, left and right word boundary
<<
matches a left word boundary: it matches
positions where there is a non-word character at the left (or the start
of the string) and a word character to the right.>>
matches a right word boundary: it matches
positions where there is a word character at the left and a non-word
character at the right (or the end of the string).my $str = 'The quick brown fox';
say so $str ~~ /br/; # True
say so $str ~~ /<< br/; # True
say so $str ~~ /br >>/; # False
say so $str ~~ /own/; # True
say so $str ~~ /<< own/; # False
say so $str ~~ /own >>/; # True
Grouping and Capturing
In regular (non-regex) Perl 6, you can use parentheses to group things together, usually to override operator precedence:say 1 + 4 * 2; # 9, because it is parsed as 1 + (4 * 2)
say (1 + 4) * 2; # 10
/ a || b c / # matches 'a' or 'bc'
/ ( a || b ) c / # matches 'ac' or 'bc'
/ a b+ / # Matches an 'a' followed by one or more 'b's
/ (a b)+ / # Matches one or more sequences of 'ab'
/ (a || b)+ / # Matches a sequence of 'a's and 'b's, at least one long
?
quantifier) the capture becomes a list of Match objects instead.Capturing
The round parentheses don't just group, they also capture; that is, they make the string matched within the group available as a variable, and also as an element of the resulting Match object:my $str = 'number 42';
if $str ~~ /'number ' (\d+) / {
say "The number is $0"; # the number is 42
# or
say "The number is $/[0]"; # the number is 42
}
if 'abc' ~~ /(a) b (c)/ {
say "0: $0; 1: $1"; # 0: a; 1: c
}
$0
and $1
etc. syntax is actually just a shorthand; these captures are canonically available from the match object $/
by using it as a list, so $0
is actually syntax sugar for $/[0]
.Coercing the match object to a list gives an easy way to programmatically access all elements:
if 'abc' ~~ /(a) b (c)/ {
say $/.list.join: ', ' # a, c
}
Non-capturing grouping
The parentheses in regexes perform a double role: they group the regex elements inside and they capture what is matched by the sub-regex inside.To get only the grouping behavior, you can use square brackets
[ ... ]
instead.if 'abc' ~~ / [a||b] (c) / {
say ~$0; # c
}
Capture numbers
It is stated above that captures are numbered from left to right. While true in principle, this is also overly simplistic.The following rules are listed for the sake of completeness; when you find yourself using them regularly, it is worth considering named captures (and possibly subrules) instead.
Alternations reset the capture count:
/ (x) (y) || (a) (.) (.) /
# $0 $1 $0 $1 $2
if 'abc' ~~ /(x)(y) || (a)(.)(.)/ {
say ~$1; # b
}
$_ = 'abcd';
if / a [ b (.) || (x) (y) ] (.) / {
# $0 $0 $1 $2
say ~$2; # d
}
if 'abc' ~~ / ( a (.) (.) ) / {
say "Outer: $0"; # Outer: abc
say "Inner: $0[0] and $0[1]"; # Inner: b and c
}
Named captures
Instead of numbering captures, you can also give them names. The generic -- and slightly verbose -- way of naming captures is like this:if 'abc' ~~ / $<myname> = [ \w+ ] / {
say ~$<myname> # abc
}
$<myname>
, is a shorthand for indexing the match object as a hash, in other words: $/{ 'myname' }
or $/<myname>
.Coercing the match object to a hash gives you easy programmatic access to all named captures:
if 'count=23' ~~ / $<variable>=\w+ '=' $<value>=\w+ / {
my %h = $/.hash;
say %h.keys.sort.join: ', '; # value, variable
say %h.values.sort.join: ', '; # 23, count
for %h.kv -> $k, $v {
say "Found value '$v' with key '$k'";
# outputs two lines:
# Found value 'count' with key 'variable'
# Found value '23' with key 'value'
}
}
Subrules
Just like you can put pieces of code into subroutines, you can also put pieces of regex into named rules.my regex line { \N*\n }
if "abc\ndef" ~~ /<line> def/ {
say "First line: ", $<line>.chomp; # First line: abc
}
my regex thename { body here }
, and called with <thename>
. At the same time, calling a named regex installs a named capture with the same name.If the capture should be of a different name, this can be achieved with the syntax
<capturename=regexname>
. If no capture at all is desired, a leading dot will suppress it: <.regexname>
.Here is a more complete (yet still fairly limited) code for parsing
ini
files:my regex header { \s* '[' (\w+) ']' \h* \n+ }
my regex identifier { \w+ }
my regex kvpair { \s* <key=identifier> '=' <value=identifier> \n+ }
my regex section {
<header>
<kvpair>*
}
my $contents = q:to/EOI/;
[passwords]
jack=password1
joy=muchmoresecure123
[quotas]
jack=123
joy=42
EOI
my %config;
if $contents ~~ /<section>*/ {
for $<section>.list -> $section {
my %section;
for $section<kvpair>.list -> $p {
say $p<value>;
%section{ $p<key> } = ~$p<value>;
}
%config{ $section<header>[0] } = %section;
}
}
say %config.perl;
# ("passwords" => {"jack" => "password1", "joy" => "muchmoresecure123"},
# "quotas" => {"jack" => "123", "joy" => "42"}).hash
Adverbs
Adverbs modify how regexes work and give very convenient shortcuts for certain kinds of recurring tasks.There are two kinds of adverbs: regex adverbs apply at the point where a regex is defined and matching adverbs apply at the point that a regex matches against a string.
This distinction often blurs, because matching and declaration are often textually close but using the method form of matching makes the distinction clear.
'abc' ~~ /../
is roughly equivalent to 'abc'.match(/../)
, or even more clearly written in separate lines:my $regex = /../; # definition
if 'abc'.match($regex) { # matching
say "'abc' has at least two characters";
}
:i
go into the definition line and matching adverbs like :overlap
are appended to the match call:my $regex = /:i . a/;
for 'baA'.match($regex, :overlap) -> $m {
say ~$m;
}
# output:
# ba
# aA
Regex Adverbs
Adverbs that appear at the time of a regex declaration are part of the actual regex and influence how the Perl 6 compiler translates the regex into binary code.For example, the
:ignorecase
(:i
) adverb tells the compiler to ignore the distinction between upper case, lower case and title case letters.So
'a' ~~~ /A/
is false, but 'a' ~~ /:i A/
is a successful match.Regex adverbs can come before or inside a regex declaration and only affect the part of the regex that comes afterwards, lexically.
These two regexes are equivalent:
my $rx1 = rx:i/a/; # before
my $rx2 = rx/:i a/; # inside
my $rx3 = rx/a :i b/; # matches only the b case insensitively
my $rx4 = rx/:i a b/; # matches completely case insensitively
/ (:i a b) c / # matches 'ABc' but not 'ABC'
/ [:i a b] c / # matches 'ABc' but not 'ABC'
Ratchet
The:ratchet
or :r
adverb causes the regex engine not to backtrack.Without this adverb, parts of a regex will try different ways to match a string in order to make it possible for other parts of the regex to match. For example in
'abc' ~~ /\w+ ./
, the \w+
first eats up the whole string, abc
but then the .
fails. Thus \w+
gives up a character, matching only ab
, and the .
can successfully match the string c
. This process of giving up characters (or in the case of alternations, trying a different branch) is known as backtracking.say so 'abc' ~~ / \w+ . /; # True
say so 'abc' ~~ / :r \w+ . /; # False
my regex identifier { \w+ }
and my regex keyword { if | else | endif }
, you intuitively expect the identifier
to gobble up a whole word and not have it give up its end to the next
rule, if the next rule otherwise fails. For instance, you don't expect
the word motif
to be parsed as the identifier mot
followed by the keyword if
; rather you expect motif
to be parsed as one identifier and if the parser expects an if
afterwards, rather have it fail than parse the input in a way you don't expect.Since ratcheting behavior is so often desirable in parsers, there is a shortcut to declaring a ratcheting regex:
my token thing { .... }
# short for
my regex thing { :r ... }
Sigspace
The:sigspace
or :s
adverb makes whitespace significant in a regex.say so "I used Photoshop®" ~~ m:i/ photo shop /; # True say so "I used a photo shop" ~~ m:i:s/ photo shop /; # True say so "I used Photoshop®" ~~ m:i:s/ photo shop /; # False
m:s/ photo shop /
acts just the same as if one had written m/ photo <.ws> shop <.ws> /
. By default, <.ws>
makes sure that words are separated, so a b
and ^&
will match <.ws>
in the middle, but ab
won't.Where whitespace in a regex turns into
<.ws>
depends on what comes before the whitespace. In the above example, whitespace in the beginning of a regex doesn't turn into <.ws>
,
but whitespace after characters does. In general, the rule is that if a
term might match something, whitespace after it will turn into <.ws>
.In addition, if whitespace comes after a term, but before a quantifier (
+
, *
, or ?
), <.ws>
will be matched after every match of the term, so foo +
becomes [ foo <.ws> ]+
. On the other hand, whitespace after a quantifier acts as normal significant whitespace, e.g., "foo+
" becomes foo+ <.ws>
.In all, this code:
rx :s {
^^
{
say "No sigspace after this";
}
<.assertion_and_then_ws>
characters_with_ws_after+
ws_separated_characters *
[
| some "stuff" .. .
| $$
]
:my $foo = "no ws after this";
$foo
}
rx { ^^ <.ws> { say "No space after this"; } <.assertion_and_then_ws> <.ws> characters_with_ws_after+ <.ws> [ws_separated_characters <.ws>]* <.ws> [ | some <.ws> "stuff" <.ws> .. <.ws> . <.ws> | $$ <.ws> ] <.ws> :my $foo = "no ws after this"; $foo <.ws> }If a regex is declared with the
rule
keyword, both the :sigspace
and :ratchet
adverbs are implied.Grammars provide an easy way to override what
<.ws>
matches:grammar Demo {
token ws {
<!ww> # only match when not within a word
\h* # only match horizontal whitespace
}
rule TOP { # called by Demo.parse;
a b '.'
}
}
# doesn't parse, whitspace required between a and b
say so Demo.parse("ab."); # False
say so Demo.parse("a b."); # True
say so Demo.parse("a\tb ."); # True
# \n is vertical whitespace, so no match
say so Demo.parse("a\tb\n."); # False
ws
.Matching adverbs
In contrast to regex adverbs, which are tied to the declaration of a regex, matching adverbs only make sense while matching a string against a regex.They can never appear inside a regex, only on the outside -- either as part of an
m/.../
match or as arguments to a match method.Continue
The:continue
or short :c
adverb takes an
argument. The argument is the position where the regex should start to
search. By default, it searches from the start of the string, but :c
overrides that. If no position is specified for :c
it will default to 0
unless $/
is set, in which case it defaults to $/.to
.given 'a1xa2' {
say ~m/a./; # a1
say ~m:c(2)/a./; # a2
}
Exhaustive
To find all possible matches of a regex -- including overlapping ones -- and several ones that start at the same position, use the:exhaustive
(short :ex
) adverb.given 'abracadabra' {
for m:exhaustive/ a .* a / -> $match {
say ' ' x $match.from, ~$match;
}
}
abracadabra
abracada
abraca
abra
acadabra
acada
aca
adabra
ada
abra
Global
Instead of searching for just one match and returning a Match object, search for every non-overlapping match and return them in a List. In order to do this use the:global
adverb:given 'several words here' {
my @matches = m:global/\w+/;
say @matches.elems; # 3
say ~@matches[2]; # here
}
:g
is shorthand for :global
.Pos
Anchor the match at a specific position in the string:given 'abcdef' {
my $match = m:pos(2)/.*/;
say $match.from; # 2
say ~$match; # cdef
}
:p
is shorthand for :pos
.Overlap
To get several matches, including overlapping matches, but only one (the longest) from each starting position, specify the:overlap
(short :ov
) adverb:given 'abracadabra' {
for m:overlap/ a .* a / -> $match {
say ' ' x $match.from, ~$match;
}
}
abracadabra
acadabra
adabra
abra
Look-around assertions
Lookahead assertions
To check that a pattern appears before another pattern, one can use a lookahead assertion via thebefore
assertion. This has the form:<?before pattern>
foo
which is immediately followed by the string bar
, one could use the following regexp:rx{ foo <?before bar> }
say "foobar" ~~ rx{ foo <?before bar> }; #-> foo
<!before pattern>
foo
which is not before bar
would be matched byrx{ foo <!before bar> }
Lookbehind assertions
To check that a pattern appears before another pattern, one can use a lookbehind assertion via theafter
assertion. This has the form:<?after pattern>
bar
which is immediately preceded by the string foo
, one could use the following regexp:rx{ <?after foo> bar }
say "foobar" ~~ rx{ <?after foo> bar }; #-> bar
<!after pattern>
bar
which do not have foo
before them would be matched byrx{ <!after foo> bar }
Best practices and gotchas
Regexes and grammars are a whole programming paradigm that you have to learn (if you don't already know it very well).To help you write robust regexes and grammars, here are some best practices that the authors have found useful. These range from small-scale code layout issues to what actually to match, and help to avoid common pitfalls and writing unreadable code.
Code layout
Without the:sigspace
adverb, whitespace is not
significant in Perl 6 regexes. Use that to your own advantage and insert
whitespace where it increases readability. Also insert comments where
necessary.Compare the very compact
my regex float { <[+-]>?\d*'.'\d+[e<[+-]>?\d+]? }
my regex float {
<[+-]>? # optional sign
\d* # leading digits, optional
'.'
\d+
[ # optional exponent
e <[+-]>? \d+
]?
}
When you use a list of alternations inside a parenthesis or brackets, align the vertical bars:
my regex example {
<preabmle>
[
|| <choice_1>
|| <choice_2>
|| <choice_3>
]+
<postamble>
}
Keep it small
Regexes come with very little boilerplate, so they are often more compact than regular code. Thus it is important to keep regexes short.When you can come up with name for a part of a regex, it is usually best to put it into a separate, named regex.
For example you could take the float regex from earlier:
my regex float {
<[+-]>? # optional sign
\d* # leading digits, optional
'.'
\d+
[ # optional exponent
e <[+-]>? \d+
]?
}
my token sign { <[+-]> }
my token decimal { \d+ }
my token exponent { 'e' <sign>? <decimal> }
my regex float {
<sign>?
<decimal>?
'.'
<decimal>
<exponent>?
}
my regex float {
<sign>?
[
|| <decimal>? '.' <decimal> <exponent>?
|| <decimal> <exponent>
]
}
What to match
Often the input data format has no clear-cut specification, or the specification is not known to the programmer. Then it is good to be liberal in what you expect, but only as long as there are no ambiguities possible.For example in
ini
files:[section]
key=value
[two words]
, or use dashes, or so. Instead of asking what's allowed on the inside, it might be worth asking instead: what's not allowed?Clearly, closing brackets are not allowed, because
[a]b]
would be rather ambiguous. By the same argument, opening brackets should be forbidden. This leaves us withtoken header { '[' <-[ \[\] ]>+ ']' }
[with a
newline in between]
token header { '[' <-[ \[\] \n ]>+ ']' }
Matching Whitespace
The:sigspace
adverb (or using the rule
declarator instead of token
or regex
) is very handy for implicitly parsing whitespace that can appear in many places.Going back to the example of parsing
ini
files, we havemy regex kvpair { \s* <key=identifier> '=' <value=identifier> \n+ }
my regex kvpair { \s* <key=identifier> \s* '=' \s* <value=identifier> \n+ }
my rule kvpair { <key=identifier> '=' <value=identifier> \n+ }
\n+
doesn't have anything left to match (and rule
also disables backtracking, so no luck here).Therefore it is important to redefine your definition of implicit whitespace to whitespace that is not significant in the input format.
This works by redefining the token
ws
, however it only works in grammars:grammar IniFormat {
token ws { <!ww> \h* }
rule header { '[' (\w+) ']' \n+ }
token identifier { \w+ }
rule kvpair { \s* <key=identifier> '=' <value=identifier> \n+ }
token section {
<header>
<kvpair>*
}
token TOP {
<section>*
}
}
my $contents = q:to/EOI/;
[passwords]
jack = password1
joy = muchmoresecure123
[quotas]
jack = 123
joy = 42
EOI
say so IniFormat.parse($contents);
token ws { <!ww> \h* }
<!ww>
,
negated "within word" assertion), and zero or more horizontal space
characters. The limitation to horizontal whitespace is important,
because newlines (which are vertical whitespace) delimit records and
shouldn't be matched implicitly.Still there is some whitespace-related trouble lurking. The regex
\n+
won't match a string like "\n \n"
, because there is a blank between the two newlines. To allow such input strings, replace \n+
by \n\s*
.ref : http://doc.perl6.org/language/regexes
Tidak ada komentar:
Posting Komentar