Previous Page
Next Page

5.7. Match Variables

Don't use the regex match variables.

Whenever you use English, it's important to load the module with a special argument:

    use English qw( -no_match_vars );

This argument prevents the module from creating the three "match variables": $PREMATCH (or $'), $MATCH (or $&), and $POSTMATCH (or $'). Whenever these variables appear anywhere in a program, they force every regular expression in that program to save three extra pieces of information: the substring the match initially skipped (the "prematch"), the substring it actually matched (the "match"), and the substring that followed the match (the "postmatch").

Every regex has to do this every time any pattern match succeeds, because these punctuation variables are global in scope, and hence available everywhere. So the regex that sets them might not be in the same lexical scope, the same package, or even the same file as the code that next uses them. The compiler can't know which regex will have been the most recently successful at any point, so it has to play it safe and set the match variables every time any regex anywhere matches, in case that particular match is the one that precedes the use of one of the match variables.

This particular problem neatly illustrates why all non-lexical variables cause difficulties. The presence of $', $&, or $' immediately couples a particular piece of code to (potentially) every single regex in your program. Leaving aside the extra workload that connection imposes on every pattern match, this also means that debugging pattern matches can be potentially much more difficult. If one of the match variables doesn't contain what you expected, it's possible that's because it was actually set by some pattern match other than the one you thought was setting it. And that pattern match could be anywhere in your source code.

Don't ever use the match variables:

    use English;

    my ($name, $birth_year)
        = $manuscript =~ m/(\S+) \s+ was \s+ born \s+ in \s+ (\d{4})/xms;

    if ($name) {
        print $PREMATCH,
              qq{<born date="$birth_year" name="$name">},

It's better to use extra capturing parentheses to retain the required context information:

    my ($prematch, $match, $name, $birth_year, $postmatch)
        = $manuscript =~ m{ (\A.*?)    
# capture prematch from start
# then capture entire match...
(\S+) \s+ was \s+ born \s+ in \s+ (\d{4}) ) (.*\z)
# then capture postmatch to end
}xms; if ($name) { print $prematch, qq{<born date="$birth_year" name="$name">}, $match, q{</born>}, $postmatch; }

This solution avoids imposing a performance penalty on every regex match when you're only using the match variables from one. However, it does penalize this particular regex in another way: by making it much uglier, and burying the significant part of the regex under a mound of extra parentheses. It can also be tricky to remember that the entire match is now the second capture, and so the $match variable has to be declared ahead of $name and $birth_year. Indeed, having the entire match captured ahead of parts of the match may seem counterintuitive to subsequent readers of the code.

A cleaner solution is to use the Regexp::MatchContext CPAN module. This module extends the Perl regex syntax with a new metasyntactic construct: (?p). The module also exports three subroutines named PREMATCH( ), MATCH( ), and POSTMATCH( ). These subroutines return those respective parts of the match context of the most recent regex with a (?p) marker anywhere inside it.

You could simplify the previous example by rewriting it like this:

    use Regexp::MatchContext;

    my ($name, $birth_year)
        = $manuscript =~ m/(?p) (\S+) \s+ was \s+ born \s+ in \s+ (\d{4})/xms;

    if ($name) {
        print PREMATCH( ),
              qq{<born date="$birth_year" name="$name">},
              MATCH( ),
              POSTMATCH( );

Note how close this example is to the original version of the code. Apart from using three subroutines instead of three global variables, the only change from the original version is that you have to put a (?p) marker in the regex. That's a tiny bit more work, but it confers several significant advantages. For a start, it explicitly marks which regex is capturing the match variables, so it's easier to work out which code to debug when a match variable goes wrong.

Better still, unlike English, the Regexp::MatchContext module does the extra match-variable-preservation work only for those particular regexes that have a (?p) marker, so there's no longer an overhead imposed on all the other regexes in your program. And even in those regexes that do set the match variables, Regexp::MatchContext does most of the extra work lazily. That is, the information is extracted only when you actually use one of the match variables, not when the regex is originally matched.

Yet another advantage to using Regexp::MatchContext is that the subroutines it exports return a genuine substr-like substring, rather than a read-only copy. You can assign a value to MATCH( ) and that assignment will change the corresponding sections of the original string. For example, you could rework the following slightly obscure substitution:

    $html =~ s{.*? (<body> .* </body>) .*}      # Locate components of page
              {   $STD_HEADER                   # Ensure standard header is used
                . verify_body($1)               # Check contents
                . '</html>'                     # Remove any trailing extras

replacing it with a more readable match-and-reassign version:

    use Regexp::MatchContext;

    if ($html =~ m{(?p) <body> .* </body>}xms) {   
# Locate body of page (with context)
# Ensure standard header is used
MATCH() = verify_body( MATCH( ) );
# Check contents
POSTMATCH( ) = '</html>';
# Remove any trailing extras

    Previous Page
    Next Page