Previous Page
Next Page

12.2. Line Boundaries

Always use the /m flag.

In addition to always using the /x flag, always use the /m flag. In every regular expression you ever write.

The normal behaviour of the ^ and $ metacharacters is unintuitive to most programmers, especially if they're coming from a Unix background. Almost all of the Unix utilities that feature regular expressions (e.g., sed, grep, awk) are intrinsically line-oriented. So in those utilities, ^ and $ naturally mean "match at the start of any line" and "match at the end of any line", respectively.

But they don't mean that in Perl.

In Perl, ^ and $ mean "match at the start of the entire string" and "match at the end of the entire string". That's a crucial difference, and one that leads to a very common type of mistake:

    
    # Find the end of a Perl program...

    $text =~ m{ [^\0]*?       # match the minimal number of non-null chars
                ^_  _END_  _$    # until a line containing only an end-marker
              }x;

In fact, what that code really does is:

    $text =~ m{ [^\0]*?       # match the minimal number of non-null chars
                ^             # until the start of the string
                _  _END_  _      # then match the end-marker
                $             # then match the end of the string
              }x;

The minimal number of characters until the start of the string is, of course, zero[*]. Then the regex has to match '_ _END_ _'. And then it has to be at the end of the string. So the only strings that this pattern matches are those that consist of '_ _END_ _'. That is clearly not what was intended.

[*] "What part of 'the start' don't you understand???"

The /m mode makes ^ and $ work "naturally"[]. Under /m, ^ no longer means "match at the start of the string"; it means "match at the start of any line". Likewise, $ no longer means "at end of string"; it means "at end of any line".

[] That is, it makes them work in the unnatural way in which most programmers think they work.

The previous example could be fixed by making those two metacharacters actually mean what the original developer thought they meant, simply by adding a /m:


    

    # Find the end of a Perl program...
$text =~ m{ [^\0]*?
# any non-nulls
^_ _END_ _$
# until an end-marker line
}xm;

Which now really means:


    $text =~ m{ [^\0]*?      
# match the minimal number of chars
^
# until the start of any line (/m mode)
_ _END_ _
# then match the end-marker
$
# then match the end of a line (/m mode)
}xm;

Consistently using the /m on every regex makes Perl's behaviour consistently conform to your unreasonable expectations. So you don't have to unreasonably change your expectations to conform to Perl's behaviour[*].

[*] In Maxims for Revolutionists (1903), George Bernard Shaw observed: "The reasonable man adapts himself to the world; the unreasonable one persists in trying to adapt the world to himself. Therefore all progress depends on the unreasonable man." That is an equally deep and powerful approach to programming.

    Previous Page
    Next Page