Section 8.5. Separated Data

8.5. Separated Data

Use split to extract simple variable-width fields.

For data that is laid out in fields of varying width, with defined separators (such as tabs or commas) between the fields, the most efficient way to extract those fields is using a split. For example, if a single comma is the field separator:


    
    # Specify field separator...
    Readonly my $RECORD_SEPARATOR => q{,};
    Readonly my $FIELD_COUNT      => 3;

    # Grab each line/record...
    while (my $record = <$sales_data>) {
        chomp $record;

        # Extract all fields...
        my ($ident, $sales, $price)
            = split $RECORD_SEPARATOR, $record, $FIELD_COUNT+1;

        # Append each record, translating ID codes and
        # normalizing sales (which are stored in 1000s)...
        push @sales, {
            ident => translate_ID($ident),
            sales => $sales * 1000,
            price => $price,
        };
    }

Note the use of the third argument to split. Typically, split is called with only two arguments: the separator itself ($RECORD_SEPARATOR), and then the string from which the fields are to be split out ($record). If a third argument is provided, however, it specifies the maximum number of distinct fields that the split should return.

It's good practice to always provide this extra information if it's known, because otherwise split splits its input as many times as possible, builds a (potentially very long) list of the results, and returns it. The assignment would then throw away all but the first three elements of the returned list, so it's a (potentially very expensive) waste of time to create them in the first place.

In some circumstances, the optimizer can work out how many return values you were expecting, and will automatically supply the third argument itself. However, being explicit is still the better practice, because your code will stay efficient when someone later modifies your statement to something that isn't automatically optimized.

It can also be useful to capture the "residue" that's left after you've split out the fields you expected. For example, to warn about suspect records:


    my ($ident, $sales, $price, $unexpected_data)
            = split $RECORD_SEPARATOR, $record, $FIELD_COUNT+1;

    carp "Unexpected trailing garbage at end of record id '$ident':\n",
         "\t$unexpected_data\n"
             if $unexpected_data;

Using the third argument is highly recommended, but caution is also required. A common error here is to use the actual number of fields you want as the third argument:

    my ($ident, $sales, $price)
            = split $RECORD_SEPARATOR, $record, $FIELD_COUNT;

instead of that number plus one. If you're trying to extract the first three fields of each record, the field count has to be four, because you need to break the record into four parts: the first three fields (which will be captured in the variables) plus the remainder of the string (which will be ignored). Using $FIELD_COUNT instead of $FIELD_COUNT+1 tells split to return three pieces, so it would break $record twice and return the resulting three substrings: ID, sales, and price-plus-whatever-else-followed-it-in-the-original-string.