Section 12.18. Tabular Regexes

12.18. Tabular Regexes

Build regular expressions from tables.

Tables like the one shown at the end of the previous guideline are a cleaner way of structuring regex matches, but they can also be a cleaner way of building a regex in the first placeespecially when the resulting regex will be used to extract keys for the table.

Don't duplicate existing table information as part of a regular expression:

    
    # Table of irregular plurals...
    my %irregular_plural_of = (
        'child'       => 'children',
        'brother'     => 'brethren',
        'money'       => 'monies',
        'mongoose'    => 'mongooses',
        'ox'          => 'oxen',
        'cow'         => 'kine',
        'soliloquy'   => 'soliloquies',
        'prima donna' => 'prime donne',
        'octopus'     => 'octopodes',
        'tooth'       => 'teeth',
        'toothfish'   => 'toothfish',
    );

    # Pattern matching any of those irregular plurals...
    my $has_irregular_plural = qr{
        child     | brother     | mongoose
      | ox        | cow         | monkey
      | soliloquy | prima donna | octopus
      | tooth(?:fish)?
    }xms;

    # Form plurals...
    while (my $word = <>) {
        chomp $word;

        if ($word =~ m/ ($has_irregular_plural) /xms) {
            print $irregular_plural_of{$word}, "\n";
        }
        else {
            print form_regular_plural_of($word), "\n";
        }
    }

Apart from the annoying redundancy of specifying each key twice, this kind of duplication is a prime opportunity for mistakes to creep in. As they didtwicein the previous example^[*].

^[*] The regular expression shown matches 'monkey', but the particular irregular noun it's supposed to match in that case is 'money'. The regex also matches 'primadonna' instead of 'prima donna', because the /x flag makes the intervening space non-significant within the regex.

It's much easier to ensure consistency between a look-up table and the regex that feeds it if the regex is automatically constructed from the table itself. That's relatively easy to achieve, by replacing the regex definition with:


    
    # Build a pattern matching any of those irregular plurals...
    my $has_irregular_plural
        = join '|', map {quotemeta $_} reverse sort keys %irregular_plural_of;

The assignment statement starts by extracting the keys from the table (keys %irregular_plural_of), then sorts them in reverse order (reverse sort keys %irregular_plural_of). Sorting them is critical because the order in which hash keys are returned is unpredictable, so there's a 50/50 chance that the key 'tooth' will appear in the key list before the key 'toothfish'. That would be unfortunate, because the list of keys is about to be converted to a list of alternatives, and regexes always match the left-most alternative first. In that case, the word "toothfish" would always be matched by the alternative 'tooth', rather than by the later alternative 'toothfish'.

Once the keys are in a reliable order, the map operation escapes any metacharacters within the keys (map {quotemeta $_} keys %irregular_plural_of). This step ensures, for example, that 'prima donna' becomes 'prima\ donna', and so behaves correctly under the /x flag. The various alternatives are then joined together with standard "or" markers to produce the full pattern.

Setting up this automated process takes a little extra effort, but it significantly improves the robustness of the resulting code. Not only does it eliminate the possibility of mismatches between the table keys and the regex alternatives, it also makes extending the table a one-step operation: just add the new singular/plural pair to the initialization of %irregular_plural_of; the pattern in $has_irregular_plural will automatically reconfigure itself accordingly.

About the only way the code could be further improved would be to factor out the hairy regex-building statements into a subroutine:


    
    # Build a pattern matching any of the arguments given...
    sub regex_that_matches {
        return join '|', map {quotemeta $_} reverse sort @_;
    }

    # and later...

    my $has_irregular_plural
        = regex_that_matches(keys %irregular_plural_of);

Note thatas is so often the caserefactoring shaggy code in this way not only cleans up the source in which the statements were formerly used, but also makes the refactored statements themselves a little less hirsute.

Note that if you're in some strange locale where strings with common prefixes don't sort shortest-to-longest, then you may need to be more specific (but less efficient) about your sorting order, by including an explicit length comparison in your sort block:


    
    # Build a pattern matching any of the arguments given...
    sub regex_that_matches {
        return join '|',
                    map {quotemeta $_}
                        # longest strings first, otherwise alphabetically...
                        sort { length($b) <=> length($a) or $a cmp $b }
                             @_;
    }

    # and later...

    my $has_irregular_plural
        = regex_that_matches(keys %irregular_plural_of);