Section 12.22. Factoring Alternations

12.22. Factoring Alternations

Factor out common affixes from alternations.

It's not just single character alternatives that are slow. Any alternation of subpatterns can be expensive. Especially if the resulting set of alternatives involves a repetition.

Every alternative that has to be tried requires the regex engine to backtrack up the string and re-examine the same sequence of characters it just rejected. And, if the alternatives are inside a repeated subpattern, the repetition itself may have to backtrack and retry every alternative from a different starting point. That kind of nested backtracking can easily produce an exponential increase in the time the complete match requires.

As if those problems weren't bad enough, alternations aren't very smart either. If one alternative fails, the matching engine just backs up and tries the next possibility, with absolutely no forethought as to whether that next alternative can possibly match.

For example, when a regular expression like:

    m{
       with \s+ each \s+ $EXPR \s* $BLOCK
     | with \s+ each \s+ $VAR  \s* in \s* [(] $LIST [)] \s* $BLOCK
     | with \s+ [(] $LIST [)] \s* $BLOCK
    }xms

is matching a string, it obviously tries the first alternative first. Suppose the string begins 'with er go est...'. In that case, the first alternative will successfully match with, then successfully match \s+, then successfully match e, but will then fail to match r (since it expected ach at that point). So the regex engine will backtrack to the start of the string and try the second alternative instead. Once again, it will successfully match with and \s+ and e, but then once again fail to match r. So it will backtrack to the start of the string once more and try the third alternative. Yet again it will successfully match with, then \s+, before failing to match the [(].

That's much less efficient than it could be. The engine had to backtrack twice and, in doing so, it had to retest and rematch the same with \s+ subpattern three times, and the longer with \s+ e subpattern twice.

A human in the same situation would notice that all three alternatives start the same way, remember that the first four characters of the string matched the first time, and simply skip that part of the rematch on the rest of the alternatives.

But Perl doesn't optimize regexes in that way. There's no theoretical reason why it couldn't do so, but there is an important practical reason: it's prohibitively expensive to analyze every regex every time it's compiled, just to identify these kinds of occasional opportunities for optimization. The extra time required to be that clever almost always far outweighs any performance gain that might be derived. So Perl sticks with a "dumb but fast" approach instead.

So if you want Perl to be smarter about matching regexes of this kind, you have to do the thinking for it; analyze and optimize the regex yourself. That's not particularly difficult. You just put the set of alternatives in non-capturing parentheses:


    m{
       (?: with \s+ each \s+ $EXPR \s* $BLOCK
         | with \s+ each \s+ $VAR  \s* in \s* [(] $LIST [)] \s* $BLOCK
         | with \s+ [(] $LIST [)] \s* $BLOCK
       )
    }xms

then grab the common prefix shared by every alternative and factor it out, placing it in front of the parentheses:


    m{
       with \s+
       (?: each \s+ $EXPR \s* $BLOCK
         | each \s+ $VAR  \s* in \s* [(] $LIST [)] \s* $BLOCK
         | [(] $LIST [)] \s* $BLOCK
       )
    }xms

This version of the regex does exactly what a human (or programmer) would do: it matches the with \s+ once only, then tries the three alternative "completions", without stupidly backtracking to the very start of the string to recheck that the initial 'with ' is still there.

Of course, having made that optimization, you might well find it opens up other opportunities to avoid backtracking and rechecking. For example, the first two alternations in the non-capturing parentheses now both start with each \s+, so you could repeat the factoring-out on just those two alternatives, by first wrapping them in another set of parentheses:


    m{
       with \s+
       (?:
           (?: each \s+ $EXPR \s* $BLOCK
             | each \s+ $VAR  \s* in \s* [(] $LIST [)] \s* $BLOCK
           )
         | [(] $LIST [)] \s* $BLOCK
       )
    }xms

and then extracting the common prefix:


    m{
       with \s+
       (?: each \s+
           (?: $EXPR \s* $BLOCK
             | $VAR  \s* in \s* [(] $LIST [)] \s* $BLOCK
           )
         | [(] $LIST [)] \s* $BLOCK
       )
    }xms

Likewise, if every alternative ends in the same sequencein this case, \s* $BLOCKthen that common sequence can be factored out and placed after the alternatives:


    m{
       with \s+
       (?: each \s+
           (?:$EXPR
             | $VAR  \s* in \s* [(] $LIST [)]
           )
         | [(] $LIST [)]
       )
       \s* $BLOCK
    }xms

Note, however, that there is a significant price to be paid for these optimizations. Compared to the original:

    m{
       with \s+ each \s+ $EXPR \s* $BLOCK
     | with \s+ each \s+ $VAR  \s* in \s* [(] $LIST [)] \s* $BLOCK
     | with \s+ [(] $LIST [)] \s* $BLOCK
    }xms

the final version of the regex is considerably more efficient, but also considerably less readable. Of course, the original is no great beauty either, so that may not be a critical issue, especially if the refactored regex is appropriately commented and perhaps turned into a constant:


    Readonly my $WITH_BLOCK => qr{
       with \s+                                # Always a 'with' keyword
       (?: each \s+                            #  If followed by 'each'
           (?:$EXPR                            #    Then expect an expression
             | $VAR  \s* in \s* [(] $LIST [)]  #    or a variable and list
           )
         | [(] $LIST [)]                       #  Otherwise, no 'each' and just a list
       )
       \s* $BLOCK                              # And the loop block always at the end
    }xms;