Chapter 12. Regular Expressions

Some people, when confronted with a problem, think:
"I know, I'll use regular expressions".
Now they have two problems.
Jamie Zawinski

Regular expressions are one of the signature features of Perl, providing it with most of the practical extraction facilities for which it is famous. Many of those who are new to Perl (and many who aren't so new) approach regexes with mistrust, trepidation, or outright fear.

And with some justification. Regexes are specified in a compact and sometimes baroque syntax that is, all by itself, responsible for much of Perl's "executable line noise" reputation. Moreover, in the right hands, patterns are capable of performing mystifying feats of text recognition, analysis, transformation, and computation^[*].

^[*] As anyone who has seen Abigail's virtuoso "prime number identifier" must surely agree:

    sub is_prime {
        my ($number) = @_;
        return (1 x $number) !~ m/\A (?: 1? | (11+?) (?> \1+ ) ) \Z/xms;
    }

(Working out precisely how this regex works its wonders is left as a punishment for the reader.)

It's no wonder they scare so many otherwise stalwart Perl hackers.

And no surprise that they also figure heavily in many suboptimal programming practices, especially of the "cut-and-paste" variety. Or, more often, of the "cut-and-paste-and-modify-slightly-and-oh-now-it-doesn't-work-at-all-so-let's-modify-it-some-more-and-see-if-that-helps-no-it-didn't-but-we're-committed-now-so-maybe-if-we-change-that-bit-instead-hmmmm-that's-closer-but-still-not-quite-right-maybe-if-I-made-that-third-repetition-non-greedy-instead-oops-now-it's-back-to-not-matching-at-all-perhaps-I-should-just-post-it-to-PerlMonks.org-and-see-if-they-know-what's-wrong" variety.

Yet the secret to taming regular expressions is remarkably simple. You merely have to recognize them for what they really are, and treat them accordingly.

And what are regular expressions really? They're subroutines. Text-matching subroutines. Text-matching subroutines that are coded in an embedded programming language that's nearly entirely unrelated to Perl.

Once you realize that regexes are just code, it becomes obvious that regex best practices will, for the most part, simply be adaptations of the universal coding best practices described in other chapters: consistent and readable layout, sensible naming conventions, decomposition of complex code, refactoring of commonly used constructs, choosing robust defaults, table-based techniques, code reuse, and test-driven development.

This chapter illustrates how those approaches can be applied to improving the readability, robustness, and efficiency of your regular expressions.