Previous Page
Next Page

12.20. Canned Regexes

Consider using Regexp::Common instead of writing your own regexes.

Regular expressions are wonderfully easy to code wrongly: to miss edge-cases, to include unexpected (and incorrect) matches, or to create a pattern that's correct but hopelessly inefficient. And even when you get your regex right, you still have to maintain the code that you used to build it.

It's a drag. Worse, it's everybody's drag. All around the world there are thousands of Perl programmers continually reinventing the same regexes: to match numbers, and URLs, and quoted strings, and programming language comments, and IP addresses, and Roman numerals, and zip codes, and Social Security numbers, and balanced brackets, and credit card numbers, and email addresses.

Fortunately there's a CPAN module named Regexp::Common, whose entire purpose is to generate these kinds of everyday regular expressions for you. The module installs a single hash (%RE), tHRough which you can create thousands of commonly needed regexes.

For example, instead of building yourself a number-matcher:

    # Build a regex that matches floating point representations...
    Readonly my $DIGITS    => qr{ \d+ (?: [.] \d*)? | [.] \d+         }xms;
    Readonly my $SIGN      => qr{ [+-]                                }xms;
    Readonly my $EXPONENT  => qr{ [Ee] $SIGN? \d+                     }xms;
    Readonly my $NUMBER    => qr{ ( ($SIGN?) ($DIGITS) ($EXPONENT?) ) }xms;

    # and later...

    my ($number)
        = $input =~ $NUMBER;

you can ask Regexp::Common to do it for you:

    use Regexp::Common;

# Build a regex that matches floating point representations...
Readonly my $NUMBER => $RE{num}{real}{-keep};
# and later...
my ($number) = $input =~ $NUMBER;

And instead of beating your head against the appalling regex needed to match formal HTTP-style URIs:

    # Build a regex that matches HTTP addresses...
    Readonly my $HTTP => qr{

        \-_.!~*'(  ):@&=+\$,]+|(?:%[a-fA-F0-9][a-fA-F0-9]))*)(?:;(?:(?:[a-zA-Z0-9
        \-_.!~*'(  ):@&=+\$,]+|(?:%[a-fA-F0-9][a-fA-F0-9]))*))*)(?:/(?:(?:(?:[a-zA-Z0-9
        \-_.!~*'(  ):@&=+\$,]+|(?:%[a-fA-F0-9][a-fA-F0-9]))*)(?:;(?:(?:[a-zA-Z0-9
        \-_.!~*'(  ):@&=+\$,]+|(?:%[a-fA-F0-9][a-fA-F0-9]))*))*))*))(?:[?]
        (?:(?:(?:[;/?:@&=+\$,a-zA-Z0-9\-_.!~*'(  )]+|(?:%[a-fA-F0-9][a-fA-F0-9

    # Find web pages...
    while (my $uri = <>) {
        next URI if $uri !~ m/ $HTTP /xms;
        print $uri;

You can just use:

    use Regexp::Common;

# Find web pages...
URI: while (my $uri = <>) { next URI if $uri !~ m/ $RE{URI}{HTTP} /xms; print $uri; }

The benefits are perhaps most noticeable when you need a slight variation on a common regex, such as one that matches numbers in base 12, with between six and nine duodecimal places:

    use Regexp::Common;

# The alien hardware device requires duodecimal floating-point numbers...
Readonly my $NUMBER => $RE{num}{real}{-base=>12}{-places=>'6,9'}{-keep};
# and later...
my ($number) = $input =~ m/$NUMBER/xms;

or a regular expression to help expurgate potentially rude words:

    use Regexp::Common;

# Clean up their [DELETED] language...
$text =~ s{ $RE{profanity}{contextual} }{[DELETED]}gxms;

or a pattern that checks Australian postcodes:

    use Regexp::Common;
    use IO::Prompt;


    # Strewth, better find out where this bloke lives...
my $postcode = prompt 'Giz ya postcode, mate: ', -require=>{'Try again, cobber: ' => qr/\A $RE{zip}{Australia} \Z/xms};

The regexes produced by Regexp::Common are reliable, robust, and efficient, because they're in wide and continual use (i.e., endlessly crash-tested), and they're regularly maintained and enhanced by some of the most competent developers in the Perl community. The module also has the most extensive test suite on the entire CPAN, with more than 175,000 tests.

    Previous Page
    Next Page