Section 12.13. Unconstrained Repetitions

12.13. Unconstrained Repetitions

Be specific when matching "as much as possible".

The .* construct is a particularly blunt and ponderous weapon, especially under /s. For example, consider the following parser for some very simple language, in which source code, data, and configuration information are separated by % and & characters (which are otherwise illegal):

    
    # Format is: <statements> % <data> & <config>...

    if ($source =~ m/\A  (.*)  %  (.*)  &  (.*) /xms) {
        my ($statements, $data, $config) = ($1, $2, $3);

        my $prog = compile($statements, {config=>$config});
        my $res  = execute($prog, {data=>$data, config=>$config});
    }
    else {
        croak 'Invalid program';
    }

Under /s, the first .* will successfully match the entire string in $source. Then it will attempt to match a %, and immediately fail (because there's none of the string left to match). At that point the regex engine will backtrack one character from the end of the string and try to match a % again, which will probably also fail. So it will backtrack one more character, try again, backtrack once more, try again, et cetera, et cetera, et cetera.

Eventually it will backtrack far enough to successfully match %, whereupon the second .* will match the remainder of the string, then fail to match &, backtrack one character, try again, fail again, and the entire "one-step-forward-two-steps-back" sequence will be played out again. Sequences of unconstrained matches like this can easily cause regular expression matches to become unacceptably slow.

Using a .*? can help in such cases:

    if ($source =~ m/\A  (.*?)  %  (.*?)  &  (.*) /xms) {
        my ($statements, $data, $config) = ($1, $2, $3);

        my $prog = compile($statements, {config=>$config});
        my $res  = execute($prog, {data=>$data, config=>$config});
    }
    else {
        croak 'Invalid program';
    }

since the "parsimonious repetitions" will then consume as little of the string as possible. But, to do this, they effectively have to do a look-ahead at every character they match, which can also become expensive if the terminator is more complicated than just a single character.

More importantly, both .* and .*? can also mask logical errors in the parsing process. For example, if the program incorrectly had an extra % or & in it, that would simply be consumed by one of the .* or .*? constructs, and therefore treated as part of the code or data, rather than as an error.

If you know precisely what character (or characters) the terminator of a "match anything" sequence will be, then it's very much more efficientand clearerto use a complemented character class instead:


    
    # Format is: <source> % <data> & <config>...

    if ($source =~ m/\A  ([^%]*)  %  ([^&]*)  &  (.*) /xms) {
        my ($statements, $data, $config) = ($1, $2, $3);

        my $prog = compile($statements, {config=>$config});
        my $res  = execute($prog, {data=>$data, config=>$config});
    }
    else {
        croak 'Invalid program';
    }

This version matches every non-% (using [^%]*), followed by a %, followed by every non-& (via [^&]*), followed by a &, followed by the rest of the string (.*). The principal advantage is that the complemented character classes don't have to do per-character look-ahead like .*?, nor per-character backtracking like .*. Nor will this version allow an extra % or & in the source. Once again, you're encoding your exact intentions.

Note that the .* at the end of the regex is still perfectly okay. When it finally gets its chance and gobbles up the rest of the source, the match will then be finished, so no backtracking will ever occur. On the other hand, putting a.*? at the end of a regular expression is always a mistake, as it will always successfully match nothing, at which point the pattern match will succeed and then terminate. A final .*? is either redundant, or it's not doing what you intended, or you forgot a \z anchor.