Previous Section  < Day Day Up >  Next Section

B.4 Character Classes

A character class lets you represent a bunch of characters (a "class") as a single item in a regular expression. Put characters in square brackets to make a character class. A character class matches any one of the characters in the class. This pattern matches a person's name or a bird's name:

^D[ao]ve$

The pattern matches Dave or Dove. The character class [ao] matches either a or o.

To put a whole range of characters in a character class, just put the first and last characters in, separated by a hyphen. For instance, to match all English alphabetic characters:

[a-zA-Z]

When you use a hyphen in a character class to represent a range, the character class includes all the characters whose ASCII values are between the first and last character (and the first and last character). If you want a literal hyphen inside a character class, you must backslash-escape it. The character class [a-z] is the same as [abcdefghijklmnopqrstuvwxyz], but the character class [a\-z] matches only three characters: a, -, and z.

You can also create a negated character class, which matches any character that is not in the class. To create a negated character class, begin the character class with ^:

// Match everything but letters

[^a-zA-Z]

The character class [^a-zA-Z] matches every character that isn't an English letter: digits, punctuation, whitespace, and control characters. Even though ^ is used as an anchor outside of character classes, its only special meaning inside a character class is negation. If you want to use a literal ^ inside a character class, either don't put it first in the character class or backslash-escape it. Each of these patterns match the same strings:

[0-9][%^][0-9]

[0-9][\^%][0-9]

Each pattern matches a digit, then either % or ^, then another digit. This matches strings such as 5^5, 3%2, or 1^9.

Character classes are more efficient than alternation when choosing among single characters. Instead of s(a|o|i)p, which matches sap, sop, and sip, use s[aoi]p.

Some commonly used character classes are also represented by dedicated metacharacters, which are more concise than specifying every character in the class. These metacharacters are shown in Table B-3.

Table B-3. Character class metacharacters

Metacharacter

Description

Equivalent class

\d

Digits

[0-9]

\D

Non-digits

[^0-9]

\w

Word characters

[a-zA-Z0-9_]

\W

Non-word characters

[^a-zA-Z0-9_]

\s

Whitespace

[ \t\n\r\f]

\S

Non-whitespace

[^ \t\n\r\f]


These metacharacters can be used just like character classes. This pattern matches valid 24-hour clock times:

([0-1]\d|2[0-3]):[0-5]\d

You can also include these metacharacters inside a character class with other characters. This pattern matches hexadecimal numbers:

[\da-fA-F]+

    Previous Section  < Day Day Up >  Next Section