Regular Expressions

One of the lesser known but potentially most useful classes in all of the .NET Framework class library is Regex, which belongs to the System.Text.RegularExpressions namespace. Regex represents regular expressions. Regular expressions are a language for parsing and manipulating text. (A full treatment of the language is beyond the scope of this book, but wonderful tutorials are available both in print and on line.) Regex supports three basic types of operations:

Splitting strings into substrings using regular expressions to identify separators
Searching strings for substrings that match patterns in regular expressions
Performing search-and-replace operations using regular expressions to identify the text you want to replace

One very practical use for regular expressions is to validate user input. It’s trivial, for example, to use a regular expression to verify that a string entered into a credit card field conforms to a pattern that’s consistent with credit card numbers—that is, digits possibly separated by hyphens. You’ll see an example of such usage in a moment.

Another common use for regular expressions is to do screen scraping. Say you want to write an app that displays stock prices gathered from a real-time (or near real-time) data source. One approach is to send an HTTP request to a Web site such as Nasdaq.com and “screen scrape�?the prices from the HTML returned in the response. Regex simplifies the task of parsing HTML. The downside to screen scraping, of course, is that your app may cease to work if the format of the data changes. (I know because I once wrote an app that used screen scraping to fetch stock prices, and the day after I published it, my data source changed the HTML format of its Web pages.) But unless you can find a data source that provides the information you want as XML, screen scraping might be your only choice.

When you create a Regex object, you pass to the class constructor the regular expression to encapsulate:

Regex regex = new Regex ("[a-z]");

In the language of regular expressions, “[a-z]�?means any lowercase letter of the alphabet. You can also pass a second parameter specifying Regex options. For example, the statement

Regex regex = new Regex ("[a-z]", RegexOptions.IgnoreCase);

creates a Regex object that matches any letter of the alphabet without regard to case. If the regular expression passed to the Regex constructor is invalid, Regex throws an ArgumentException.

Once a Regex object is initialized, you call methods on it to apply the regular expression to strings of text. The following sections describe how to put Regex to work in managed applications and offer examples regarding its use.

Splitting Strings

Regex.Split splits strings into constituent parts by using a regular expression to identify separators. Here’s an example that divides a path name into drive and directory names:

Regex regex = new Regex (@"\\");
string[] parts = regex.Split (@"c:\inetpub\wwwroot\wintellect");
foreach (string part in parts)
    Console.WriteLine (part);

And here’s the output:

c:
inetpub
wwwroot
wintellect

Notice the double backslash passed to Regex’s constructor. The @ preceding the string literal prevents you from having to escape the backslash for the compiler’s sake, but because the backslash is also an escape character in regular expressions, you have to escape a backslash with a backslash to form a valid regular expression.

The fact that Split identifies separators using full-blown regular expressions makes for some interesting possibilities. For example, suppose you wanted to extract the text from the following HTML by stripping out everything in angle brackets:

<b>Every</b>good<h3>boy</h3>does<b>fine</b>

Here’s the code to do it:

Regex regex = new Regex ("<[^>]*>");
string[] parts =
    regex.Split ("<b>Every</b>good<h3>boy</h3>does<b>fine</b>");
foreach (string part in parts)
    Console.WriteLine (part);

And here’s the output:

Every
good
boy
does
fine

The regular expression �?lt;[^>]*>�?means anything that begins with an opening angle bracket (�?lt;�?, followed by zero or more characters that are not closing angle brackets (“[^>]*�?, followed by a closing angle bracket (�?gt;�?.

With Regex.Split to lend a hand, you could simplify this chapter’s WordCount utility considerably. Rather than having the GetWords method manually parse a line of text into words, you could rewrite GetWords to split the line using a regular expression that identifies sequences of one or more nonalphanumeric characters as separators. Then you could delete the GetNextWord method altogether.

Searching Strings

Perhaps the most common use for Regex is to search strings for substrings matching a specified pattern. Regex includes three methods for searching strings and identifying the matches: Match, Matches, and IsMatch.

The simplest of the three is IsMatch. It provides a simple yes or no answer revealing whether an input string contains a match for the text represented by a regular expression. Here’s a sample that checks an input string for HTML anchor tags (<a>):

Regex regex = new Regex ("<a[^>]*>", RegexOptions.IgnoreCase);
if (regex.IsMatch (input)) {
    // Input contains an anchor tag
}
else {
    // Input does NOT contain an anchor tag
}

Another use for IsMatch is to validate user input. The following method returns true if the input string contains 16 digits grouped into fours separated by hyphens, and false if it does not:

bool IsValid (string input)
{
    Regex regex = new Regex ("^[0-9]{4}-[0-9]{4}-[0-9]{4}-[0-9]{4}$");
    return regex.IsMatch (input);
}

Strings such as �?234-5678-8765-4321�?pass the test just fine; strings such as �?234567887654321�?and �?234-ABCD-8765-4321�?do not. The ^ and $ characters denote the beginning and end of the line, respectively. Without these characters, strings such as �?2345-5678-8765-4321�?would pass, even though you didn’t intend for them to. Regular expressions such as this are often used to perform cursory validations on credit card numbers. If you’d like, you can replace “[0-9]�?in a regular expression with �?d�? Thus, the expression

"^\d{4}-\d{4}-\d{4}-\d{4}$"

is equivalent to the one above.

Figure 3-4 contains the source code for a grep-like utility named NetGrep that uses IsMatch to parse a file for lines of text containing text matching a regular expression. Both the file name and the regular expression are entered on the command line. The following command lists all the lines in Index.html that contain anchor tags:

netgrep index.html "<a[^>]*>"

This command displays all lines in Readme.txt that contain numbers consisting of two or more digits:

netgrep readme.txt "\d{2,}"

In the source code listing, note the format specifier used in the WriteLine call. The “D5�?in “{0:D5}�?specifies that the line number should be formatted as a decimal value with a fixed field width of 5—for example, 00001.

NetGrep.cs

using System;
using System.IO;
using System.Text.RegularExpressions;
class MyApp
{
    static void Main (string[] args)
    {
        // Make sure a file name and regular expression were entered
        if (args.Length < 2) {
            Console.WriteLine ("Syntax: NETGREP filename expression");
            return;
        }

        StreamReader reader = null;
        int linenum = 1;

        try {
            // Initialize a Regex object with the regular expression
            // entered on the command line
            Regex regex = new Regex (args[1], RegexOptions.IgnoreCase);

            // Iterate through the file a line at a time and
            // display all lines that contain a pattern matching the
            // regular expression
            reader = new StreamReader (args[0]);
            for (string line = reader.ReadLine (); line != null;
                line = reader.ReadLine (), linenum++) {
                if (regex.IsMatch (line))
                    Console.WriteLine ("{0:D5}: {1}", linenum, line);
            }
        }
        catch (Exception e) {
            Console.WriteLine (e.Message);
        }
        finally {
            if (reader != null)
                reader.Close ();

        }
    }
}

Figure 3-4

NetGrep source code.

IsMatch tells you whether a string contains text matching a regular expression, but it doesn’t tell you where in the string the match is located or how many matches there are. That’s what the Match method is for. The following example displays all the Hrefs in Index.html that are followed by URLs enclosed in quotation marks. The metacharacter “\s�?in a regular expression denotes whitespace; “\s�?followed by an asterisk (“\s*�? means any number of consecutive whitespace characters:

Regex regex = new Regex ("href\\s*=\\s*\"[^\"]*\"", RegexOptions.IgnoreCase);

StreamReader reader = new StreamReader ("Index.html");

for (string line = reader.ReadLine (); line != null;
    line = reader.ReadLine ()) {
    for (Match m = regex.Match (line); m.Success; m = m.NextMatch ()) 
        Console.WriteLine (m.Value);
}

The Match method returns a Match object indicating either that a match was found (Match.Success == true) or that no match was found (Match.Success == false). A Match object representing a successful match exposes the text that produced the match through its Value property. If Match.Success is true and the input string contains additional matches, you can iterate through the remaining matches with Match.NextMatch.

If the input string contains (or might contain) multiple matches and you want to enumerate them all, the Matches method offers a slightly more elegant way of doing it. The following example is functionally equivalent to the one above:

Regex regex = new Regex ("href\\s*=\\s*\"[^\"]*\"", RegexOptions.IgnoreCase);

StreamReader reader = new StreamReader ("Index.html");

for (string line = reader.ReadLine (); line != null;
    line = reader.ReadLine ()) {
    MatchCollection matches = regex.Matches (line);
    foreach (Match match in matches)
        Console.WriteLine (match.Value);
}

Matches returns a collection of Match objects in a MatchCollection whose contents can be iterated over with foreach. Each Match represents one match in the input string.

Match objects have a property named Groups that permits substrings within a match to be identified. Let’s say you want to scan an HTML file for Hrefs, and for each Href that Regex finds, you want to extract the target of that Href—for example, the dotnet.html in href=“dotnet.html.�?You can do that by using parentheses to define a group in the regular expression and then use the Match object’s Groups collection to access the group. Here’s an example:

Regex regex = new Regex ("href\\s*=\\s*\"([^\"]*)\"", RegexOptions.IgnoreCase);

StreamReader reader = new StreamReader ("Index.html");

for (string line = reader.ReadLine (); line != null;
    line = reader.ReadLine ()) {
    MatchCollection matches = regex.Matches (line);
    foreach (Match match in matches)
        Console.WriteLine (match.Groups[1]);
}

Notice the parentheses that now surround the part of the regular expression that corresponds to all characters between the quotation signs. That defines those characters as a group. In the Match object’s Groups collection, Groups[0] identifies the full text of the match and Groups[1] identifies the subset of the match in parentheses. Thus, if Index.html contains the following line:

<a href="help.html">Click here for help</a>

both Value and Groups[0] evaluate to the text

href="help.html"

Groups[1], however, evaluates to

help.html

Groups can even be nested, meaning that virtually any subset of the text identified by a regular expression (or subset of a subset) can be extracted following a successful match.

Replacing Strings

If you decide to embellish NetGrep with the capability to perform search-and-replace, you’ll love Regex.Replace, which replaces text matching the regular expression in the Regex object with text you pass as an input parameter. The following example replaces all occurrences of “Hello�?with “Goodbye�?in the string named input:

Regex regex = new Regex ("Hello");
string output = regex.Replace (input, "Goodbye");

The next example strips everything in angle brackets from the input string by replacing expressions in angle brackets with null strings:

Regex regex = new Regex ("<[^>]*>");
string output = regex.Replace (input, "");

A basic knowledge of regular expressions (and a helping hand from Regex) can go a long way when it comes to parsing and manipulating text in .NET Framework applications.