Regular Expressions

One of the lesser known but potentially most useful classes in all of the .NET Framework class library is Regex, which belongs to the System.Text.RegularExpressions namespace. Regex represents regular expressions. Regular expressions are a language for parsing and manipulating text. (A full treatment of the language is beyond the scope of this book, but wonderful tutorials are available both in print and on line.) Regex supports three basic types of operations:

One very practical use for regular expressions is to validate user input. It鈥檚 trivial, for example, to use a regular expression to verify that a string entered into a credit card field conforms to a pattern that鈥檚 consistent with credit card numbers鈥攖hat is, digits possibly separated by hyphens. You鈥檒l see an example of such usage in a moment.

Another common use for regular expressions is to do screen scraping. Say you want to write an app that displays stock prices gathered from a real-time (or near real-time) data source. One approach is to send an HTTP request to a Web site such as Nasdaq.com and 鈥渟creen scrape鈥?the prices from the HTML returned in the response. Regex simplifies the task of parsing HTML. The downside to screen scraping, of course, is that your app may cease to work if the format of the data changes. (I know because I once wrote an app that used screen scraping to fetch stock prices, and the day after I published it, my data source changed the HTML format of its Web pages.) But unless you can find a data source that provides the information you want as XML, screen scraping might be your only choice.

When you create a Regex object, you pass to the class constructor the regular expression to encapsulate:

Regex聽regex聽=聽new聽Regex聽("[a-z]");

In the language of regular expressions, 鈥淸a-z]鈥?means any lowercase letter of the alphabet. You can also pass a second parameter specifying Regex options. For example, the statement

Regex聽regex聽=聽new聽Regex聽("[a-z]",聽RegexOptions.IgnoreCase);

creates a Regex object that matches any letter of the alphabet without regard to case. If the regular expression passed to the Regex constructor is invalid, Regex throws an ArgumentException.

Once a Regex object is initialized, you call methods on it to apply the regular expression to strings of text. The following sections describe how to put Regex to work in managed applications and offer examples regarding its use.

Splitting Strings

Regex.Split splits strings into constituent parts by using a regular expression to identify separators. Here鈥檚 an example that divides a path name into drive and directory names:

Regex聽regex聽=聽new聽Regex聽(@"\\");
string[]聽parts聽=聽regex.Split聽(@"c:\inetpub\wwwroot\wintellect");
foreach聽(string聽part聽in聽parts)
聽聽聽聽Console.WriteLine聽(part);

And here鈥檚 the output:

c:
inetpub
wwwroot
wintellect

Notice the double backslash passed to Regex鈥檚 constructor. The @ preceding the string literal prevents you from having to escape the backslash for the compiler鈥檚 sake, but because the backslash is also an escape character in regular expressions, you have to escape a backslash with a backslash to form a valid regular expression.

The fact that Split identifies separators using full-blown regular expressions makes for some interesting possibilities. For example, suppose you wanted to extract the text from the following HTML by stripping out everything in angle brackets:

<b>Every</b>good<h3>boy</h3>does<b>fine</b>

Here鈥檚 the code to do it:

Regex聽regex聽=聽new聽Regex聽("<[^>]*>");
string[]聽parts聽=
聽聽聽聽regex.Split聽("<b>Every</b>good<h3>boy</h3>does<b>fine</b>");
foreach聽(string聽part聽in聽parts)
聽聽聽聽Console.WriteLine聽(part);

And here鈥檚 the output:

Every
good
boy
does
fine

The regular expression 鈥?lt;[^>]*>鈥?means anything that begins with an opening angle bracket (鈥?lt;鈥?, followed by zero or more characters that are not closing angle brackets (鈥淸^>]*鈥?, followed by a closing angle bracket (鈥?gt;鈥?.

With Regex.Split to lend a hand, you could simplify this chapter鈥檚 WordCount utility considerably. Rather than having the GetWords method manually parse a line of text into words, you could rewrite GetWords to split the line using a regular expression that identifies sequences of one or more nonalphanumeric characters as separators. Then you could delete the GetNextWord method altogether.

Searching Strings

Perhaps the most common use for Regex is to search strings for substrings matching a specified pattern. Regex includes three methods for searching strings and identifying the matches: Match, Matches, and IsMatch.

The simplest of the three is IsMatch. It provides a simple yes or no answer revealing whether an input string contains a match for the text represented by a regular expression. Here鈥檚 a sample that checks an input string for HTML anchor tags (<a>):

Regex聽regex聽=聽new聽Regex聽("<a[^>]*>",聽RegexOptions.IgnoreCase);
if聽(regex.IsMatch聽(input))聽{
聽聽聽聽//聽Input聽contains聽an聽anchor聽tag
}
else聽{
聽聽聽聽//聽Input聽does聽NOT聽contain聽an聽anchor聽tag
}

Another use for IsMatch is to validate user input. The following method returns true if the input string contains 16 digits grouped into fours separated by hyphens, and false if it does not:

bool聽IsValid聽(string聽input)
{
聽聽聽聽Regex聽regex聽=聽new聽Regex聽("^[0-9]{4}-[0-9]{4}-[0-9]{4}-[0-9]{4}$");
聽聽聽聽return聽regex.IsMatch聽(input);
}

Strings such as 鈥?234-5678-8765-4321鈥?pass the test just fine; strings such as 鈥?234567887654321鈥?and 鈥?234-ABCD-8765-4321鈥?do not. The ^ and $ characters denote the beginning and end of the line, respectively. Without these characters, strings such as 鈥?2345-5678-8765-4321鈥?would pass, even though you didn鈥檛 intend for them to. Regular expressions such as this are often used to perform cursory validations on credit card numbers. If you鈥檇 like, you can replace 鈥淸0-9]鈥?in a regular expression with 鈥?d鈥? Thus, the expression

"^\d{4}-\d{4}-\d{4}-\d{4}$"

is equivalent to the one above.

Figure 3-4 contains the source code for a grep-like utility named NetGrep that uses IsMatch to parse a file for lines of text containing text matching a regular expression. Both the file name and the regular expression are entered on the command line. The following command lists all the lines in Index.html that contain anchor tags:

netgrep聽index.html "<a[^>]*>"

This command displays all lines in Readme.txt that contain numbers consisting of two or more digits:

netgrep聽readme.txt "\d{2,}"

In the source code listing, note the format specifier used in the WriteLine call. The 鈥淒5鈥?in 鈥渰0:D5}鈥?specifies that the line number should be formatted as a decimal value with a fixed field width of 5鈥攆or example, 00001.

NetGrep.cs
using聽System;
using聽System.IO;
using聽System.Text.RegularExpressions;
class聽MyApp
{
聽聽聽聽static聽void聽Main聽(string[]聽args)
聽聽聽聽{
聽聽聽聽聽聽聽聽//聽Make聽sure聽a聽file聽name聽and聽regular聽expression聽were聽entered
聽聽聽聽聽聽聽聽if聽(args.Length聽<聽2)聽{
聽聽聽聽聽聽聽聽聽聽聽聽Console.WriteLine聽("Syntax:聽NETGREP聽filename聽expression");
聽聽聽聽聽聽聽聽聽聽聽聽return;
聽聽聽聽聽聽聽聽}

聽聽聽聽聽聽聽聽StreamReader聽reader聽=聽null;
聽聽聽聽聽聽聽聽int聽linenum聽=聽1;

聽聽聽聽聽聽聽聽try聽{
聽聽聽聽聽聽聽聽聽聽聽聽//聽Initialize聽a聽Regex聽object聽with聽the聽regular聽expression
聽聽聽聽聽聽聽聽聽聽聽聽//聽entered聽on聽the聽command聽line
聽聽聽聽聽聽聽聽聽聽聽聽Regex聽regex聽=聽new聽Regex聽(args[1],聽RegexOptions.IgnoreCase);

聽聽聽聽聽聽聽聽聽聽聽聽//聽Iterate聽through聽the聽file聽a聽line聽at聽a聽time聽and
聽聽聽聽聽聽聽聽聽聽聽聽//聽display聽all聽lines聽that聽contain聽a聽pattern聽matching聽the
聽聽聽聽聽聽聽聽聽聽聽聽//聽regular聽expression
聽聽聽聽聽聽聽聽聽聽聽聽reader聽=聽new聽StreamReader聽(args[0]);
聽聽聽聽聽聽聽聽聽聽聽聽for聽(string聽line聽=聽reader.ReadLine聽();聽line聽!=聽null;
聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽line聽=聽reader.ReadLine聽(),聽linenum++)聽{
聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽if聽(regex.IsMatch聽(line))
聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽Console.WriteLine聽("{0:D5}:聽{1}",聽linenum,聽line);
聽聽聽聽聽聽聽聽聽聽聽聽}
聽聽聽聽聽聽聽聽}
聽聽聽聽聽聽聽聽catch聽(Exception聽e)聽{
聽聽聽聽聽聽聽聽聽聽聽聽Console.WriteLine聽(e.Message);
聽聽聽聽聽聽聽聽}
聽聽聽聽聽聽聽聽finally聽{
聽聽聽聽聽聽聽聽聽聽聽聽if聽(reader聽!=聽null)
聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽reader.Close聽();

聽聽聽聽聽聽聽聽}
聽聽聽聽}
}
Figure 3-4
NetGrep source code.

IsMatch tells you whether a string contains text matching a regular expression, but it doesn鈥檛 tell you where in the string the match is located or how many matches there are. That鈥檚 what the Match method is for. The following example displays all the Hrefs in Index.html that are followed by URLs enclosed in quotation marks. The metacharacter 鈥淺s鈥?in a regular expression denotes whitespace; 鈥淺s鈥?followed by an asterisk (鈥淺s*鈥? means any number of consecutive whitespace characters:

Regex聽regex聽=聽new聽Regex聽("href\\s*=\\s*\"[^\"]*\"",聽RegexOptions.IgnoreCase);

StreamReader聽reader聽=聽new聽StreamReader聽("Index.html");

for聽(string聽line聽=聽reader.ReadLine聽();聽line聽!=聽null;
聽聽聽聽line聽=聽reader.ReadLine聽())聽{
聽聽聽聽for聽(Match聽m聽=聽regex.Match聽(line);聽m.Success;聽m聽=聽m.NextMatch聽())聽
聽聽聽聽聽聽聽聽Console.WriteLine聽(m.Value);
}

The Match method returns a Match object indicating either that a match was found (Match.Success == true) or that no match was found (Match.Success == false). A Match object representing a successful match exposes the text that produced the match through its Value property. If Match.Success is true and the input string contains additional matches, you can iterate through the remaining matches with Match.NextMatch.

If the input string contains (or might contain) multiple matches and you want to enumerate them all, the Matches method offers a slightly more elegant way of doing it. The following example is functionally equivalent to the one above:

Regex聽regex聽=聽new聽Regex聽("href\\s*=\\s*\"[^\"]*\"",聽RegexOptions.IgnoreCase);

StreamReader聽reader聽=聽new聽StreamReader聽("Index.html");

for聽(string聽line聽=聽reader.ReadLine聽();聽line聽!=聽null;
聽聽聽聽line聽=聽reader.ReadLine聽())聽{
聽聽聽聽MatchCollection聽matches聽=聽regex.Matches聽(line);
聽聽聽聽foreach聽(Match聽match聽in聽matches)
聽聽聽聽聽聽聽聽Console.WriteLine聽(match.Value);
}

Matches returns a collection of Match objects in a MatchCollection whose contents can be iterated over with foreach. Each Match represents one match in the input string.

Match objects have a property named Groups that permits substrings within a match to be identified. Let鈥檚 say you want to scan an HTML file for Hrefs, and for each Href that Regex finds, you want to extract the target of that Href鈥攆or example, the dotnet.html in href=鈥渄otnet.html.鈥?You can do that by using parentheses to define a group in the regular expression and then use the Match object鈥檚 Groups collection to access the group. Here鈥檚 an example:

Regex聽regex聽=聽new聽Regex聽("href\\s*=\\s*\"([^\"]*)\"",聽RegexOptions.IgnoreCase);

StreamReader聽reader聽=聽new聽StreamReader聽("Index.html");

for聽(string聽line聽=聽reader.ReadLine聽();聽line聽!=聽null;
聽聽聽聽line聽=聽reader.ReadLine聽())聽{
聽聽聽聽MatchCollection聽matches聽=聽regex.Matches聽(line);
聽聽聽聽foreach聽(Match聽match聽in聽matches)
聽聽聽聽聽聽聽聽Console.WriteLine聽(match.Groups[1]);
}

Notice the parentheses that now surround the part of the regular expression that corresponds to all characters between the quotation signs. That defines those characters as a group. In the Match object鈥檚 Groups collection, Groups[0] identifies the full text of the match and Groups[1] identifies the subset of the match in parentheses. Thus, if Index.html contains the following line:

<a聽href="help.html">Click聽here聽for聽help</a>

both Value and Groups[0] evaluate to the text

href="help.html"

Groups[1], however, evaluates to

help.html

Groups can even be nested, meaning that virtually any subset of the text identified by a regular expression (or subset of a subset) can be extracted following a successful match.

Replacing Strings

If you decide to embellish NetGrep with the capability to perform search-and-replace, you鈥檒l love Regex.Replace, which replaces text matching the regular expression in the Regex object with text you pass as an input parameter. The following example replaces all occurrences of 鈥淗ello鈥?with 鈥淕oodbye鈥?in the string named input:

Regex聽regex聽=聽new聽Regex聽("Hello");
string聽output聽=聽regex.Replace聽(input, "Goodbye");

The next example strips everything in angle brackets from the input string by replacing expressions in angle brackets with null strings:

Regex聽regex聽=聽new聽Regex聽("<[^>]*>");
string聽output聽=聽regex.Replace聽(input, "");

A basic knowledge of regular expressions (and a helping hand from Regex) can go a long way when it comes to parsing and manipulating text in .NET Framework applications.