Section 12.19. Constructing Regexes

12.19. Constructing Regexes

Build complex regular expressions from simpler pieces.

Building a regular expression from the keys of a hash is a special case of a much more general best practice. Most worthwhile regexeseven those for simple tasksare still too tedious or too complicated to code directly. For example, to extract the components of a number, you could write:

    my ($number, $sign, $digits, $exponent)
        = $input =~ m{ (                          # Capture entire number
                         ( [+-]? )                # Capture leading sign (if any)
                         ( \d+ (?: [.] \d*)?      # Capture mantissa: NNN.NNN
                         | [.] \d+                #               or:    .NNN
                         )
                         ( (?:[Ee] [+-]? \d+)? )  # Capture exponent (if any)
                       )
                     }xms;

Even with the comments, that pattern is bordering on unreadable. And checking that it works as advertised is highly non-trivial.

But a regular expression is really just a program, so all the arguments in favour of program decomposition (see Chapter 9) apply to regexes too. In particular, it's often better to decompose a complex regular expression into manageable (named) fragments, like so:


    
    # Build a regex that matches floating point representations...
    Readonly my $DIGITS    => qr{ \d+ (?: [.] \d*)? | [.] \d+         }xms;
    Readonly my $SIGN      => qr{ [+-]                                }xms;
    Readonly my $EXPONENT  => qr{ [Ee] $SIGN? \d+                     }xms;
    Readonly my $NUMBER    => qr{ ( ($SIGN?) ($DIGITS) ($EXPONENT?) ) }xms;

    # and later...

    my ($number, $sign, $digits, $exponent)
        = $input =~ $NUMBER;

Here, the full $NUMBER regex is built up from simpler components ($DIGITS, $SIGN, and $EXPONENT), much in the same way that a full Perl program is built from simpler subroutines. Notice that, once again, refactoring cleans up both the refactored code itself and the place that code is later used.

Note, however, that interpolating qr'd regexes inside other qr'd regexes (as in the previous example) may impose a performance penalty in some cases. That's because when the component regexes are interpolated, they are first decompiled back to strings, then interpolated, and finally recompiled. Unfortunately, the conversion of the individual components back to strings is not optimized, and will sometimes produce less efficient patterns, which are then recompiled into less efficient regexes.

The alternative is to use q{} or qq{} strings to specify the components. Using strings ensures that what you write in a component is exactly what's later interpolated from it:

    
    # Build a regex that matches floating-point representations...
    Readonly my $DIGITS    =>  q{ (?: \d+ (?: [.] \d*)? | [.] \d+   ) };
    Readonly my $SIGN      =>  q{ (?: [+-]                          ) };
    Readonly my $EXPONENT  => qq{ (?: [Ee] $SIGN? \\d+              ) };
    Readonly my $NUMBER    => qr{ ( ($SIGN?) ($DIGITS) ($EXPONENT?) ) }xms;

However, using qr{} instead of strings is still the recommended practice here. Specifying subpatterns in a q{} or qq{} requires very careful attention to the use of escape characters (such as writing \\d in some, but not all, of the components). You must also remember to add an extra (?:...) around each subpattern, to ensure that the final interpolated string is treated as a single item (for example, so the ? in $EXPONENT? applies to the entire exponent subpattern). In contrast, the inside of a qr{} always behaves exactly like the inside of an m{} match, so no arcane metaquoting is required.

If you need to build very complicated regular expressions, you should also look at the Regexp::Assemble CPAN module, which allows you to build regexes in an OO style, and then optimizes the resulting patterns to minimize backtracking. The module can also optionally insert debugging information into the regular expressions it builds, which can be invaluable for highly complex regexes.