Previous Page
Next Page

12.11. Properties

Prefer properties to enumerated character classes.

Explicit character classes are frequently used to match character ranges, especially alphabetics. For example:

    # Alphabetics-only identifier...
    Readonly my $ALPHA_IDENT => qr/ [A-Z] [A-Za-z]* /xms;

However, a character class like that doesn't actually match all possible alphabetics. It matches only ASCII alphabetics. It won't recognize the common Latin-1 variants, let alone the full gamut of Unicode alphabetics.

That result might be okay, if you're sure your data will never be other than parochial, but in today's post-modern, multicultural, outsourced world it's rather déclassé for an überhacking rnin to create identifier regexes that won't even match 'déclassé' or 'überhacking' or 'r*nin'.

Regular expressions in Perl 5.6 and later[*] support the use of the \p{...} escape, which allows you to use full Unicode properties. Properties are Unicode-compliant named character classes and are both more general and more self-documenting than explicit ASCII character classes. The perlunicode manpage explains the mechanism in detail and lists the available properties.

[*] Perl's Unicode support was still highly experimental in the 5.6 releases, and has improved considerably since then. If you're intending to make serious use of Unicode in production code, you really need to be running the latest 5.8.X release you can, and at very least Perl 5.8.1.

So, if you're ready to concede that ASCII-centrism is a naïve façade that's gradually fading into Götterdämmerung, you might choose to bid it adiós and open your regexes to the full Unicode smörgåsbord, by changing the previous identifier regex to:

    Readonly my $ALPHA_IDENT => qr/ \p{Uppercase}  \p{Alphabetic}* /xms;

There are even properties to help create identifiers that follow the normal Perl conventions but are still language-independent. Instead of:

    Readonly my $PERL_IDENT => qr/ [A-Za-z_] \w*/xms;

you can use:

    Readonly my $PERL_IDENT => qr/ \p{ID_Start} \p{ID_Continue}* /xms;

One other particularly useful property is \p{Any}, which provides a more readable alternative to the normal dot (.) metacharacter. For example, instead of:

    m/ [{] . [.] \d{2} [}] /xms;

you could write:

    m/ [{] \p{Any} [.] \d{2} [}] /xms;

and leave the reader in no doubt that the second character to be matched really can be anything at allan ASCII alphabetic, a Latin-1 superscript, an Extended Latin diacritical, a Devanagari number, an Ogham rune, or even a Bopomofo symbol.

    Previous Page
    Next Page