Previous Page
Next Page

12.12. Whitespace

Consider matching arbitrary whitespace, rather than specific whitespace characters.

Unless you're matching regular expressions against fixed-format machine-generated data, avoid matching specific whitespace characters exactly. Because if humans were directly involved anywhere in the data acquisition, then the notion of "fixed" will probably have been more honoured in the breach than in the observance.

If, for example, the input is supposed to consist of a label, followed by a single space, followed by an equals sign, followed by a single space, followed by an value...don't bet on it. Most users nowadays willquite reasonablyassume that whitespace is negotiable; nothing more than an elastic formatting medium. So, in a configuration file, you're just as likely to get something like:

    name       = Yossarian, J
    rank       = Captain
    serial_num = 3192304

The whitespaces in that data might be single tabs, multiple tabs, multiple spaces, single spaces, or any combination thereof. So matching that data with a pattern that insists on exactly one space character at the relevant points is unlikely to be uniformly successful:

    $config_line =~ m{ ($IDENT)  [\N{SPACE}]  =  [\N{SPACE}]  (.*) }xms

Worse still, it's also unlikely to be uniformly unsuccessful. For instance, in the example data, it might only match the serial number. And that kind of intermittent success will make your program much harder to debug. It might also make it difficult to realize that any debugging is required.

Unless you're specifically vetting data to verify that it conforms to a required fixed format, it's much better to be very liberal in what you accept when it comes to whitespace. Use \s+ for any required whitespace and \s* for any optional whitespace. For example, it would be far more robust to match the example data against:

    $config_line =~ m{ ($IDENT)  \s*  =  \s*  (.*) }xms

    Previous Page
    Next Page