Here are three situations I've seen come up where it makes sense to avoid regexps.
I've got a program that parses MIME messages. This is an email format commonly used for HTML mail, attachments, etc. Such messages have a boundary string, defined in the headers, that separates each part. The boundary from the headers is supposed to have two hyphens added to the start for use in the body.
My original code decided to be flexible in what it accepted, and
used one regexp to parse the message into lines, then another
regexp that searched for boundaries, allowing the initial hyphens
optional. I found when dealing with very large attachments, it would
become unacceptablly slow. In trying to find a fix to this I discovered
that using index()
with a fixed search pattern and
substr()
instead of the regexp, I got a substantial
speed increase, at the cost of not being generous about broken
boundaries.
On a mail message with a 10MB attachment, 14MB MIME encoded, my original code takes about 5 and 1/2 hours to extract the attachment. With the new method, the code takes about 5 seconds to process the same message, on the same hardware.
This message parser is available in the scripts section of CPAN, it is
called mimedump
. The new parsing method is default, the old
one is available as a command line option (in case someone needs to deal
with mail with broken boundaries).
Here are some code excerpts to illustrate the changes.
# Original method 1 while($$messref =~ s/\A(.*\n)//) { 2 $line = $1; 3 if($line =~ /^-*\Q$boundary\E-*\s*$/) { 4 if (length($part)) { 5 process(\$part); 6 $part = ''; 7 } 8 } else { 9 $part .= $line; 10 } 11 } # while messref # New method 1 while(($pos = index($$messref, "\n--$boundary")) >= 0) { 2 $part = substr($$messref, 0, $pos+1, ''); 3 $$messref =~ s/^-*\Q$boundary\E-*\s*$//; 4 if (length($part)) { 5 process(\$part); 6 $part = ''; 7 } 8 } # while pos = index
This also is a good example for the maxim, "Fewer perl instuctions (run) is the faster way." The new method is only three lines shorter, but the loop runs four times on my example message, while the original method runs thousands of times.
There are several forms of delimited text, comma-separated-values is one of the more common ones. It seems easy enough to use split with a fixed string to parse this, until various edge cases come up. Things like data with commas in them, empty values, escaped quotes, embeded newlines.
Example: This is the first field,"This, now,
is the second field",,"Field ""4"""
Then people often turn to complicated regular expressions, too often they break one or more of the edge cases while reinventing the wheel. The Text::ParseWords module that comes with Perl has a quotewords() function that can do the right thing. Here are two examples from Mastering Regular Expressions (Friedl, first edition).
my $line = q; # _MRE_, first ed, page 205 @fields = (); # initialize @fields to be empty while( $line =~ m/"([^"\\]*(\\.[^"\\]*)*)",|([^,]+),?|,/g) { push(@fields, defined($1) ? $1 : $3); # add the just-matched field } push(@fields, undef) if $line =~ m/,$/; # account for an empty last field # _MRE_, first ed, page 207 @columns = quotewords(',', 0, $line); # Let's test them printf("Original: %s\n\n", $line); my $i; for ($i = 0; $i < @fields; $i ++) { printf("fields[%d]: %s\n", $i+1, $fields[$i]); } print "\n"; for ($i = 0; $i < @columns; $i ++) { printf("columns[%d]: %s\n", $i+1, $columns[$i]); } print "\n"; __END__ Original: This is the first field,"This, now, is the second field",,"Field ""4""" fields[1]: This is the first field fields[2]: This, now, is the second field fields[3]: fields[4]: "Field ""4""" columns[1]: This is the first field columns[2]: This, now, is the second field columns[3]: columns[4]: Field "4"
Oops, Friedl's method doesn't deal with doubling quotes to escape them. It is also easy to see that the regexp way is going to be a lot harder to maintain.
words : sentences :: regexps : grammars
Sometimes you just need a lot more than a regular expression can
really provide. The famous example is regular expressions cannot
match an arbitrary depth of balanced parentheses. (Strictly speaking.
What perl calls "regular expressions" are more powerful, but the
hoops one needs to jump through to do this make it a parlor trick,
not something sane to use in real code.) A grammar is able
to do that, easily. Often times the task is devided into two parts,
a "lexer" that finds words, and a "grammar" that puts them together.
Lexers are sometimes written with regexps, traditionally this is very
common in Unix. The perl parser itself is built out of a hand
coded lexer, toke.c
, and a
yacc
built parser, perly.y
.
Presumably this is because a regexp lexer isn't powerful enough to deal
with the extreme requirements of parsing perl.
yacc
is an acryonym for "yet another compiler compiler".
Normally it is used with the regexp tool lex
to build the
"lexer" part. The job of a lexer is to identify "tokens", typically
words, but also numbers, operators, etc. The grammar explains how those
tokens can be put together in a syntactically correct manner.
The most used perl grammar handler is Parse::RecDescent, which generates "top-down recursive-descent text parsers" from a grammar.
I'm not going to get into the full usage of any of the grammar tools, but I'll present a sample grammar to show the usefulness oof those tools. This grammar is defines the syntax for a function call. The call can take an argument list of nothing, or one or more integer or floating point numbers, quoted strings, variables, or other function calls, which themselves can have argument lists.
A comment on the syntax of the grammar. Identifiers appear on
the left hand side of a colon when being defined, and appear
unquoted on the right hand side. Regular expressions are
/slash quoted/
and fixed strings are
'single quoted'
. Concatentation is default, alternation
is indicated by vertical bar, grouping is with parentheses, and
zero or more repeats indicated by a star. That bit is very much like
regular expressions. Whitespace is for readability, except
that new rules must start on new lines. "Terminal" statments don't
use any of the grammar identifiers on the right-hand-side,
"non-terminal" ones do.
# Non-terminal statments functioncall : identifier '(' arglist ')' arglist : empty | ( arg ',' )* arg arg : number | string | variable | functioncall variable : identifier # Terminal statments identifier : /[a-z][a-z0-9]*/ number : /[0-9]+([.][0-9]*)?/ string : /"(\\[\\"]|[^"]+)*"/ empty : ''
Note how the statements mix quoted strings, regexps, and regexp meta characters used in a new way. Note also that the balanced parentheses requirement here (for function calls as arguments) means regexps cannot do this.
by Benjamin Elijah Griffin / Eli the Bearded, Feb 2004