Perl/Addendum: Difference between revisions
imported>Chris Day No edit summary |
imported>Chris Day m (Perl/Enhancements for readability moved to Perl/Addendum) |
(No difference)
|
Revision as of 13:04, 24 March 2008
healing slasheritis
In the standard Unix tools such as sed, a regular expression is enclosed in a pair of slashes, i.e. '/pattern/
' .
A non-printing character is written by using the backslash ("escape") character '\
', e.g. '/\n/
'
represents a newline character in a pattern. Certain printing characters --of course the metacharacter '/
' itself
is one of them-- also need to be escaped. So, to match against '/
', the pattern would be written as '/\//
'.
This is not uncommon, for example in file (path) names.
It gets confusing quickly if e.g. '\/
' is to be substituted by its duplicate '\/\/
'.
Both the backslash and the slash need to be escaped: '/\\\//
' represents
the string '\/
' inside a pattern definition. The "substitute" construct '$g =~ s/a/b/
'
(substitute 'a' by 'b') explodes into the so-called slasheritis:
'$g =~s/\\\//\\\/\\\//
', i.e. such regular Expression patterns become quickly unreadable.
Perl's solution is to allow the definition of pattern delimiters "on-the-fly", after all Perl knows exactly that a pattern
definition begins after the '=~
' operator, so why not take the well-chosen next character to represent the delimiter?
Now you can resolve the above slasheritis by writing '$g =~ s#\\/#\\/\\/#
' (you still need to escape
the backslash), and everything is (somewhat) clearer again. It is customary to use non-alphanumeric characters,
such as '!#|
' as delimiters, but since Perl knows about paired characters such as '<>
'
or '{}
', some well known Perl authors prefer this style: $a =~ s{\\/} {\\/\\/}
, because it is even clearer.
special symbols
Perl introduced a whole new flock of shortcuts for classes of characters, usually combined with their (upper case) complement,
i.e., '/\w/
' stands for all "white" characters (blank, tab, newline, and a few special ones),
and '/\W/
' (capital 'W') stands for all non-white characters. Similarly, '/\d/
'
stands for numerical characters ("digit"), '/\D/
' for non-digits, etc.
The whole list can be found in the "Camel" book [1].
inline comments
Since version 5.002 a regular expression can be written with inline comments, if the closing delimiter is followed by the 'x' oprerator.
Here a short program to eliminate comments from html code (by Perl author Tom Christiansen, with his original comments):
#!/usr/bin/perl -p0777
#
# htdecom -- remove html comments from a document
# tchrist@perl.com
#
# taken from the larger striphtml program
require 5.002;
s{ <! # comments begin with a `<!'
# followed by 0 or more comments;
(.*?) # this is actually to eat up comments in non
# random places
( # not suppose to have any white space here
# just a quick start;
-- # each comment starts with a `--'
.*? # and includes all text up to and including
-- # the *next* occurrence of `--'
\s* # and may have trailing while space
# (albeit not leading white space XXX)
)+ # repetire ad libitum XXX should be * not +
(.*?) # trailing non comment text
> # up to a `>'
}{
if ($1 || $3) { # this silliness for embedded comments in tags
"<!$1 $3>";
}
}gesx; # mutate into nada, nothing, and niente
Literature
- [1] Larry Wall, Tom Christiansen, Jon Orwant: Programming Perl - (the Camel Book). O'Reilly Media, Inc.; 3 edition (July 14, 2000). ISBN 0596000278. The standard reference.
- [2] Jeffrey E. F. Friedl: Mastering Regular Expressions - (the Owls Book). O'Reilly Media, Inc.; 3 edition (August 8, 2006). ISBN 0596528124. All you ever need to know about Regular Expressions, not Perl specific