Previous Section  < Day Day Up >  Next Section

B.6 PHP's PCRE Functions

Use the functions in PHP's PCRE extension to work with regular expressions in your programs. These functions allow you to match a string against a pattern and to alter a string based on how it matches a pattern. When you pass a pattern to one of the PCRE functions, it must be enclosed in delimiters. Traditionally, the delimiters are slashes, but you can use any character that's not a letter, number, or backslash as a delimiter. If the character you choose as a delimiter appears in the pattern, it must be backslash-escaped in the pattern, so you should only use a nonslash delimiter when a slash is in your pattern.

After the closing delimiter, you can add one or more pattern modifiers to change how the pattern is interpreted. These modifiers are listed at http://www.php.net/pcre.pattern.modifiers. One handy modifier is i, which makes the pattern matching case-insensitive. For example, the patterns (with delimiters) /[a-zA-Z]+/ and /[a-z]+/i produce the same results.

Another useful modifier is s, which makes the dot metacharacter match newlines. The pattern (with delimiters) @<b>.*?</b>@ matches a set of <b></b> tags and the text between them, but only if that text is all on one line. To match text that may include newlines, use the s modifier:

@<b>.*?</b>@s

B.6.1 Matching

The preg_match( ) function tests whether a string matches a pattern. Pass it the pattern and the string to test as arguments. It returns 1 if the string matches the pattern and 0 if it doesn't. Example B-2 demonstrates preg_match( ).

Example B-2. Matching with preg_match( )
// Test the value of $_POST['zip'] against the

// pattern ^\d{5}(-\d{4})?$

if (preg_match('/^\d{5}(-\d{4})?$/',$_POST['zip'])) {

    print $_POST['zip'] . ' is a valid US ZIP Code';

}



// Test the value of $html against the pattern <b>[^<]+</b>

// The delimiter is @ since / occurs in the pattern

$is_bold = preg_match('@<b>[^<]+</b>@',$html);

A set of parentheses in a pattern capture what matches the part of the pattern inside the parentheses. To access these captured strings, pass an array to preg_match( ) as a third argument. The captured strings are put into the array. The first element of the array (element 0) contains the string that matches the entire pattern, and subsequent array elements contain the strings that match the parts of the pattern in each set of parentheses. Example B-3 shows how to use preg_match( ) with capturing.

Example B-3. Capturing with preg_match( )
// Test the value of $_POST['zip'] against the

// pattern ^\d{5}(-\d{4})?$

if (preg_match('/^(\d{5})(-\d{4})?$/',$_POST['zip'],$matches)) {

    // $matches[0] contains the entire zip

    print "$matches[0] is a valid US ZIP Code\n";

    // $matches[1] contains the five digit part inside the first

    // set of parentheses

    print "$matches[1] is the five-digit part of the ZIP Code\n";

    // If they were present in the string, the hyphen and ZIP+4 digits

    // are in $matches[2]

    if (isset($matches[2])) {

        print "The ZIP+4 is $matches[2];";

    } else {

        print "There is no ZIP+4";

    }

}



// Test the value of $html against the pattern @<b>[^<]+</b>

// The delimiter is @ since / occurs in the pattern

$is_bold = preg_match('@<b>([^<]+)</b>@',$html,$matches);

if ($is_bold) {

    // $matches[1] contains what's inside the bold tags

    print "The bold text is: $matches[1]";

}

Each bit of text that matches the parts of the pattern in each set of parentheses goes into its own element in $matches. The parentheses map to array elements in order of the opening parentheses from left to right. Example B-4 uses preg_match( ) with nested parentheses to illustrate how the captured strings are put into $matches.

Example B-4. Capturing with nested parentheses
if (preg_match('/^(\d{5})(-(\d{4}))?$/',$_POST['zip'],$matches)) {

    print "The beginning of the ZIP Code is: $matches[1]\n";

    // $matches[2] contains what's in the second set of parentheses:

    // The hyphen and the last four digits

    // $matches[3] contains just the last four digits

    if (isset($matches[2])) {

        print "The ZIP+4 is: $matches[3]";

    }

}

If $_POST['zip'] is 19096-2321, Example B-4 prints:

The beginning of the ZIP Code is: 19096

The ZIP+4 is: 2321

A companion to preg_match( ) is preg_match_all( ). While preg_match( ) just matches a pattern against a string once, preg_match_all( ) matches a pattern against a string as many times as the pattern allows and returns the number of times it matched. Example B-5 illustrates the difference between the two functions.

Example B-5. Matching with preg_match_all( )
$html = <<<_HTML_

<ul>

<li>Beef Chow-Fun</li>

<li>Sauteed Pea Shoots</li>

<li>Soy Sauce Noodles</li>

</ul>

_HTML_;



preg_match('@<li>(.*?)</li>@',$html,$matches);

$match_count = preg_match_all('@<li>(.*?)</li>@',$html,$matches_all);



print "preg_match_all( ) matched $match_count times.\n";



print "preg_match( ) array: ";

var_dump($matches);



print "preg_match_all( ) array: ";

var_dump($matches_all);

Example B-5 prints:

preg_match_all( ) matched 3 times.

preg_match( ) array: array(2) {

  [0]=>

  string(22) "<li>Beef Chow-Fun</li>"

  [1]=>

  string(13) "Beef Chow-Fun"

}

preg_match_all( ) array: array(2) {

  [0]=>

  array(3) {

    [0]=>

    string(22) "<li>Beef Chow-Fun</li>"

    [1]=>

    string(27) "<li>Sauteed Pea Shoots</li>"

    [2]=>

    string(26) "<li>Soy Sauce Noodles</li>"

  }

  [1]=>

  array(3) {

    [0]=>

    string(13) "Beef Chow-Fun"

    [1]=>

    string(18) "Sauteed Pea Shoots"

    [2]=>

    string(17) "Soy Sauce Noodles"

  }

}

The first array printed is the $matches array populated by preg_match( ). Element 0 is the string that matches the entire pattern, and element 1 is the string that is captured by the first set of parentheses. The pattern <li>(.*?)</li> matches an item in an HTML list. With preg_match( ), this pattern just matches the first list item in $html. After finding one successful match, preg_match( ) is done.

The preg_match_all( ) function behaves differently. After matching against the first list item like preg_match( ) does, it tries to match the pattern again, starting in the string where the first match left off. After a successful match, preg_match_all( ) starts over at the character after the match. This process repeats until preg_match_all( ) is out of characters. Element 0 of the $matches_all array populated by preg_match_all( ) contains an array of entire-pattern matches. The first time through the string, the entire pattern matched <li>Beef Chow-Fun</li>, so that's the first element of this subarray. The second time through, the entire pattern matched <li>Sauteed Pea Shoots</li>, so that's the second element of this subarray, and so on. Element 1 of the $matches_all array contains the strings captured by the first set of parentheses each time through the string: Beef Chow-Fun, Sauteed Pea Shoots, and Soy Sauce Noodles.

There are some flags you can pass to preg_match( ) and preg_match_all( ) that affect how the captured strings are stored in the $matches array. The flags are listed in the PHP Manual at http://www.php.net/preg_match and http://www.php.net/preg_match_all.

Captured text can itself be part of a pattern by using backreferences. These are metacharacters within a pattern that refer to captured strings by number. A backreference is a backslash followed by the number of the captured string. Example B-6 uses a backreference to match starting and ending HTML tags.

Example B-6. Matching using backreferences
$ok_html  = "I <b>love</b> shrimp dumplings.";

$bad_html = "I <b>love</i> shrimp dumplings.";



if (preg_match('@<([bi])>.*?</\1>@',$ok_html)) {

    print "Good for you! (OK, Backreferences)\n";

}

if (preg_match('@<([bi])>.*?</\1>@',$bad_html)) {

    print "Good for you! (Bad, Backreferences)\n";

}

if (preg_match('@<[bi]>.*?</[bi]>@',$ok_html)) {

    print "Good for you! (OK, No backreferences)\n";

}

if (preg_match('@<[bi]>.*?</[bi]>@',$bad_html)) {

    print "Good for you! (Bad, No backreferences)\n";

}

Example B-6 prints:

Good for you! (OK, Backreferences)

Good for you! (OK, No backreferences)

Good for you! (Bad, No backreferences)

The backreferences in the first two patterns ensure that the closing tag matches the opening tag. The b in the opening tag has to match a /b in the closing tag. This is why the OK, Backreferences line prints, but not the Bad, Backreferences line. The $bad_html string doesn't match the backreferences pattern because its tags don't match. The patterns without backreferences match either a <b> or <i> opening tag and either a </b> or </i> closing tag, whether or not the opening and closing tags go together. So, both No backreferences lines are printed.

B.6.2 Replacing

The preg_replace( ) function looks for parts of a string that match a pattern and then replaces those matching parts with new text. Pass preg_replace( ) a pattern, replacement text, and a string to search, as shown in Example B-7. The function returns the changed string.

Example B-7. Replacing with preg_replace( )
$members=<<<TEXT



Name               E-Mail Address

------------------------------------------------

Inky T. Ghost      inky@pacman.example.com

Donkey K. Gorilla  kong@banana.example.com

Mario A. Plumber   mario@franchise.example.org

Bentley T. Bear    bb@xtal-castles.example.net

TEXT;



print preg_replace('/[^@\s]+@([-a-z0-9]+\.)+[a-z]{2,}/',

                   '[ address removed ]', $members);

Example B-7 uses the email address-matching regular expression from Section 6.4.4 to replace email addresses with the string [ address removed ]. It prints:

Name               E-Mail Address

------------------------------------------------

Inky T. Ghost      [ address removed ]

Donkey K. Gorilla  [ address removed ]

Mario A. Plumber   [ address removed ]

Bentley T. Bear    [ address removed ]

You can use backreferences to include captured text in replacement strings. Example B-8 doesn't remove email addresses entirely, but changes the @ to "at".

Example B-8. Replacing using backreferences
$members=<<<TEXT



Name               E-Mail Address

------------------------------------------------

Inky T. Ghost      inky@pacman.example.com

Donkey K. Gorilla  kong@banana.example.com

Mario A. Plumber   mario@franchise.example.org

Bentley T. Bear    bb@xtal-castles.example.net

TEXT;



print preg_replace('/([^@\s]+)@(([-a-z0-9]+\.)+[a-z]{2,})/',

                   '\1 at \2', $members);

Example B-8 prints:

Name               E-Mail Address

------------------------------------------------

Inky T. Ghost      inky at pacman.example.com

Donkey K. Gorilla  kong at banana.example.com

Mario A. Plumber   mario at franchise.example.org

Bentley T. Bear    bb at xtal-castles.example.net

B.6.3 Array Processing

The preg_split( ) function is a souped-up version of the explode( ) function from Chapter 4. With preg_split( ), the delimiter that chops up a string is a regular expression. Use preg_split( ) when you want to break a string apart based on something more complicated than a literal sequence of characters. Example B-9 uses preg_split( ) with a string containing a list of things to eat. The preg_split( ) function is necessary because the things to eat aren't all separated by the same delimiter.

Example B-9. Using preg_split( )
$sea_creatures = "cucumber;jellyfish, conger eel,shrimp, crab roe; bluefish";

// Break apart the string on a comma or semicolon

// followed by an optional space

$creature_list = preg_split('/[,;] ?/',$sea_creatures);

print "Would you like some $creature_list[2]?";

Example B-9 prints:

Would you like some conger eel?

A third argument to preg_split( ) sets a maximum number of elements in the list that gets returned. In Example B-10, $creature_list has only three elements.

Example B-10. Limiting the number of returned elements with preg_split( )
$sea_creatures = "cucumber;jellyfish, conger eel,shrimp, crab roe; bluefish";

// Break apart the string into at most three elements

$creature_list = preg_split('/, ?/',$sea_creatures, 3);

print "The last element is $creature_list[2]";

When the number of elements is limited, preg_split( ) puts everything extra in the last element. Example B-10 prints:

The last element is conger eel,shrimp, crab roe; bluefish

If there are two successive delimiters in the string, preg_split( ) inserts an empty string into the array that it returns. Usually, you want to tell preg_split( ) not to include empty elements in the array it returns by specifying the constant PREG_SPLIT_NO_EMPTY as a fourth argument. When you do this, you either need to specify a limit as a third argument or pass -1 as the third argument to tell preg_split( ) "no limit." Example B-11 uses this feature to count the words in $text.

Example B-11. Discarding empty elements with preg_split( )
$text=<<<TEXT

"It's time to ring again," said Tom rebelliously.

"I agree! I'll help you," said Jerry resoundingly.

TEXT;



// Get each of the words in $text, but don't put the whitespace and

// punctuation into $words. The -1 for the limit argument means "no limit"

$words = preg_split('/[",.!\s]/', $text, -1, PREG_SPLIT_NO_EMPTY);



print 'There are ' . count($words) .' words in the text.';

Example B-11 prints:

There are 16 words in the text.

The preg_grep( ) function finds elements of an array whose values match a regular expression. Example B-12 uses preg_grep( ) to find all of the words from Example B-11 that contain consecutive double letters.

Example B-12. Using preg_grep( )
$text=<<<TEXT

"It's time to ring again," said Tom rebelliously.

"I agree! I'll help you," said Jerry resoundingly.

TEXT;



$words = preg_split('/[",.!\s]/', $text, -1, PREG_SPLIT_NO_EMPTY);



// Find words that contain double letters

$double_letter_words = preg_grep('/([a-z])\\1/i',$words);



foreach ($double_letter_words as $word) {

    print "$word\n";

}

Example B-12 prints:

rebelliously

agree

I'll

Jerry

    Previous Section  < Day Day Up >  Next Section