Regular Expressions in .NET

To my mind there is nothing regular about regular expressions. Consider this regular expression: href\s*=\s*(?:""(?<1>[^""]*)""|(?<1>\S+).^[1] Now what is so regular about this? I think that the expression 'Find all the href='…" values and their locations in a string" is a lot more regular.

Although my English language expression may seem more regular and is certainly more humanly readable, the .NET Framework interprets the actual regular expression very nicely. As you know, being a programmer forces you to talk to the computer in its own language.^[2] Make no mistake, "Regular Expression" may not be a language, but it has a syntax just like any another computer language such as C# or VB.

So, what are regular expressions used for and how do you use them in .NET? Regular expressions are used to parse strings. However, this is a bit simplistic. Parsing strings brings to mind reading a line of text and extracting a substring from that text. Parsing also brings to mind extracting information from a comma-delimited file.

Regular expressions are much more than just parsing, though. Regular expressions in .NET can extract, insert, change, and delete any pattern in any string either forward or backward. This is powerful stuff.

Were regular expressions invented for .NET? No, they have been around longer than I have been programming, which is a very long time. For those real oldies who used the code editor Brief, regular expressions were a part of daily programming life. Those of you from the UNIX world dreamed in regular expression syntax.

The RegularExpressions Namespace

A whole namespace is devoted to regular expressions and their use: System.Text.RegularExpressions. In here you will find eight classes and one enumeration devoted to regular expressions. You will even find an event that you can hook to for custom validation during a matching operation. I go over these classes lightly in this section just to give you an idea of what they are used for. After this I charge headlong into the geeky world of regular expression usage.

The Capture class provides a result from a regular expression's subexpression capture. This means that if you use regular expressions to extract a substring, this is where you would find the answer. This class is immutable, meaning its properties cannot be changed. You will not be able to instantiate this class, and it does not appear by magic. Instead, you get an instance of this class from the CaptureCollection class.

The CaptureCollection class is a collection of Capture classes.^[3] So, as you probably guessed, this is a collection of the entire set of substrings returned by a particular regular expression search.

If you have a complicated regular expression that includes more than one substring search, how do you get the instances of the Capture class? You can't get them all from a CaptureCollection, because this gives you only the Capture instances for a single subexpression. What you need is a group.

The Group class is used to hold a collection of CaptureCollections. At the very least it will hold a collection of one Capture object. At most it will hold as many CaptureCollections as are needed by the expression.

Note

You definitely should know that collections could contain collections ad infinitum. If you don't, then study the collection classes for some more examples.

Of course, you cannot have a Group object without a collection of Group objects to hold it. This is the GroupCollection class. It contains a set of groups resulting from a regular expression match.

If you were to supply a regular expression to the framework to evaluate, you would naturally need an object as a result. This object is the Match class. The Match class is derived from the Group class, which is in turn derived from the Capture class. Therefore, the Match class holds all the results from a single regular expression call. It is in the returned Match object that you start digging for your results.

The last biggie in this list of classes is the Regex class. This class holds the regular expression that you need evaluated. If you call Regex.Match with a regular expression, you will get back a Match object. Check the Success property of this object for any hits.

The Regular Expression Syntax

Before I go on, you need to know a little about the regular expression syntax. First in the list are the escape characters. An escape character is a backslash followed by a special character or set of characters. For instance, in C-derived languages the escape character \n means newline, which most of the time gets converted to a carriage return/linefeed pair. Table 8-1 contains a list of regular expression escape characters.

Table 8-1: Regular Expression Escape Characters
Character Sequence	Meaning
\a	Bell character.
\b	Backspace or word boundary.
\t	Tab.
\r	Carriage return.
\v	Vertical tab.
\f	Form feed.
\n	Newline.
\e	Escape.
\0nnn	nn represents an octal number. The whole expression represents an octal number.
\0xnn	nn represents a hex number. The hex number is an ASCII character.
\cA	ASCII control character. This is Ctrl-A.

There are a few other characters, including back references, that are beyond the scope of this book. This next table is your first foray into the simple use of regular expressions. Table 8-2 contains a list of character matching commands.

Table 8-2: Character Matching Commands
Command	Meaning
[abcd]	Matches anything in the brackets.
[^abcd]	Matches anything not in the brackets.
[0-9]	The dash is used as an extender; same as [0123456789].
.	The period matches any character.
\p{name}	Matches any character in the named character class.
\P{name}	Matches any character not in the named character class.
\w	Matches any word or character that follows.
\W	Matches any word or character that does not follow.
\s	Matches any white space character.
\S	Matches any non-white-space character.
\d	Matches any decimal digit; same as [0-9].
\D	Matches any nondecimal character; same as [^0-9].

Is this all you need to know about the grammar? No. There is a set of commands called quantifiers that you can use to add additional information to the search pattern. The quantifiers apply only to that group or character class that precedes them. So, for example, I could have the expression [abcd]?. This means that the question mark quantifier acts on the bracket pattern. In this case, instead of finding all matches of abcd, it only finds only zero or one match. Table 8-3 shows the list of quantifiers.

Table 8-3: Regular Expression Quantifiers
Character(s)	Meaning
*	Zero or more matches
+	One or more matches
?	Zero or one match
{n}	Exactly n matches
{n,}	At least n matches
{n,m}	At least n but no more than m matches
*?	Gets first match that consumes the fewest repeats
+?	Specifies as few repeats as possible but at least one repeat
??	Gets match using zero repeats if possible
{n}?	Same as {n}
{n,}?	Gets at least n matches with as few repeats as possible
{n,m}?	Gets at least n to m matches with as few repeats as possible

Notice the overuse of the question mark? This is called the lazy quantifier. Usually the regular expression engine is greedy; it tries to find as many matches as possible with the constraints you gave it. The lazy quantifier tells the engine to match only what is necessary to achieve a match and nothing more. In essence, a quantifier token tells the parser how many times the previous expression should be matched. Once you start working with regular expressions, you will see that without quantifiers you can get major performance hits when running the regular expression engine. The trick is to be as specific as possible in your expression.

Here are some examples of regular expressions and their results. The text I am searching for in these examples is the same. The text being searched is the same between examples also. The only thing that changes is the quantifiers. If you are unfamiliar with regular expressions, hopefully this will clarify things for you.

Make a small console application in either VB or C#. The code for this application is shown here.

using System;
using System.Text.RegularExpressions;

namespace RegX_c
{
  class Class1
  {
    [STAThread]
    static void Main(string[] args)
    {
      Regex r = new Regex("Sp[ace] [1-9]*");

      for (Match m = r.Match("Space 1999 Spac 1999 Spa 1999 Sp 1999");
m.Success; m = m.NextMatch())
        Console.WriteLine(m.Value);
 
     Console.ReadLine();
    }
  }
}

Option Strict On

Imports System
Imports System.Text.RegularExpressions

Module Module1

  Sub Main()
    Dim m As Match
    Dim text As String = "Space 1999 Spac 1999 Spa 1999 Sp 1999"

    Dim r As Regex = New Regex("Sp[ace] [1-9]*")

    m = r.Match(text)
    While m.Success
      Console.WriteLine(m.Value)
      m = m.NextMatch()
    End While
    Console.ReadLine()
  End Sub

End Module

I have here a text string that consists of several variations of the title to an old TV show called Space 1999. I have also instantiated a Regex object with the regular expression that determines the strings I am looking for. Here is the result of running this example:

Spa 1999

Exciting, isn't it? The regular expression told the parser to look for any strings that matched the s, followed by the p, followed by the a, followed by a space, followed by zero or more matches of the digits 1–9. Now, you're probably wondering, why didn't the parse spit back the actual text "Space 1999"?

When you are looking for something in brackets, it means match anything in there starting with the first character in the brackets. Because I had no quantifiers for the brackets, the default is to find only one match. The first character in the brackets is a, and this is what it started with. Note that the brackets represent one character position in the string. So, what happened is that the parser looked for "Spa 1999" as a first try, got a hit, and bailed out.

Change the regular expression to this:

"Sp[ace]? [1-9]*"

Now run the program and you should get the following results:

Spa 1999
Sp 1999

Why the two results? I had you add a ? quantifier to the bracketed text. Table 8-3 states that ? means find zero or one matches. So the parser first found zero matches in the form of "Sp 1999" and then it found one match in the form of "Spa 1999". Again, it spit these matches out and bailed. The parser did no more than what you told it to do.

Change the regular expression to this:

"Sp[ace]+ [1-9]*"

All you are doing is swapping out the ? quantifier with a + quantifier. Here are the results:


Space 1999
Spac 1999
Spa 1999

The + quantifier tells the parser to find one or more matches. It found all three. Remember that this quantifier has to find at least one match to succeed. Now change the quantifier from + to * like this:

"Sp[ace]* [1-9]*"

This means find zero or more matches. Here are the results:

Space 1999
Spac 1999
Spa 1999
Sp 1999

The parser found everything. The "Sp 1999" answer is the result of finding zero matches. The other answers are the result of finding the "or more" matches.

The three quantifiers ?, *, and + are like wild cards. You need to be careful with them because they could take up more time than you think and force your computer to come to a grinding halt. The only one that is really safe here is ?, the lazy quantifier. However, even this quantifier can return results you may not be looking for.

Exact Matching

If you have a situation where you are looking for an exact number of hits, use exact matching. Change the regular expression again to this:

Sp[ace]{2} [1-9]*"

What you are doing here is telling the parser to find an exact match of two characters in the brackets in any order. Here is what it found:

Spac 1999

This does not tell you much. How about changing your search string to this:

"Space 1999 Spca 1999 Spac 1999 Spa 1999 Sp 1999"

You added an alternate spelling of "Spac" with "Spca". It uses the same two characters but swapped. Here is the result:


Spca 1999
Spac 1999

As you can see, the numerical quantifier does not care about the order of the characters in the brackets. It will ruthlessly hunt down any variation and tell you about it.

So, is this all there is to regular expressions? Not by a long shot. In fact, many articles and books are dedicated to the subject. I know a few people who pride themselves on inventing the most complicated-looking regular expressions you could imagine. I have a plan in mind for you regarding regular expressions, however, so I will show you only a little more.

Text Replacement

There are two more features of regular expressions you need to know about: search-and-replace and search-and-delete. They are actually the same thing, but you can treat them as different here. Search and delete is actually search and replace with an empty string.

The Regex class has several static functions. These static functions allow you to input a regular expression and get an answer without having to compile the regular expression first. In this case, you will be using the overloaded function called Regex.Replace.

Essentially, this static function creates a one-time use of a Regex class, uses it for the intended purpose, and then throws it away. Here is a simple replacement function using one of the overloaded versions of the Regex.Replace method.

    private static void Replace()
    {
      //Replace all instances of the word could with the word should
      string OrgString = "This could be done. It could be accomplished now. " +
                         "I couldn't get it done in time";
      string SearchPattern = "could ";
      string ReplacePattern = "should ";

      Console.WriteLine(OrgString + "\n\n");
      Console.WriteLine(Regex.Replace(OrgString, SearchPattern, ReplacePattern));

      Console.ReadLine();
    }

  Sub Replace()
    'Replace all instances of the word could with the word should
    Dim OrgString As String = "This could be done. " + _
                              "It could be accomplished now. " + _
                              "I couldn't get it done in time"
    Dim SearchPattern As String = "could "
    Dim ReplacePattern As String = "should "

    Console.WriteLine(OrgString + vbCrLf + vbCrLf)
    Console.WriteLine(Regex.Replace(OrgString, SearchPattern, ReplacePattern))

    Console.ReadLine()
  End Sub

This is about as simple as a replacement can get. I search for any instance of the string "could" and replace it with the string "should". I include the space in the search string to avoid the hit on the word "couldn't".

Suppose the first word of a sentence was capitalized? This replace expression would miss it. An easy way to fix this problem is to use the replace function twice.

    private static void Replace2()
    {
      //Replace all instances of the word could with the word should
      string OrgString      = "Could it be done? It could be done now.";

      Console.WriteLine(OrgString + "\n");
      OrgString = Regex.Replace(OrgString, "Could", "Should");
      Console.WriteLine(Regex.Replace(OrgString, "could", "should"));

      Console.ReadLine();
    }

  Sub Replace2()
    'Replace all instances of the word could with the word should
    Dim OrgString As String = "Could it be done? It could be done now."
    Console.WriteLine(OrgString + "\n")
    OrgString = Regex.Replace(OrgString, "Could", "Should")
    Console.WriteLine(Regex.Replace(OrgString, "could", "should"))

    Console.ReadLine()
  End Sub

The output of this function is as follows:

Could it be done? It could be done now.
Should it be done? It should be done now.

This is simple text replacement. You can get quite a bit more complicated. If you want to know more about text replacement, I suggest the reams of information available on the Internet or in the online help.

So, why am I covering regular expressions? Validation.

Regular Expression Validation

Now you know a little about regular expressions. Take my word for it, this introduction only scratches the surface. Now what?

Remember the validation routines for TextBox input? A few of them that you have seen look for patterns of characters using conditional statements in code. What about replacing those statements with regular expressions? Here are some common things to validate for in text box input:

Accept only nonnumeric characters.
Accept only numeric characters.
Accept characters in a certain order.
Accept dates based on the culture setting.
Match a registration key that is entered in a specific format.
Allow US-style ZIP codes.
Allow only US-style phone numbers.
Allow international phone numbers.
Validate a URI.
Validate an IP address.
Accept passwords that must have at least six characters of which two are numbers.

Some of this stuff can be quite lengthy to validate using code. Much of it can be boiled down to a single line of code using a regular expression. Table 8-4 shows some common regular expressions and what they do.

Table 8-4: Common Data Validation Expressions
Expression	Meaning
[0-9]	Matches any single number within a string
\d	Matches any single number within a string
[^0-9]	Matches any single nonnumeric character within a string
\D	Matches any single nonnumeric character within a string
[A-Za-z]	Matches any uppercase or lowercase letter in a string
\d{5}(−\d{4})?	Matches U.S. 5-digit ZIP code or 5+4-digit ZIP code
1−[2-9]{1}\d{2}−\d{3}−\d{4}	Matches U.S.-style phone number (i.e., n-nnn-nnn-nnnn)
\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}	Matches IP address format (not actual addresses)
([\w−]+)@([\w−]+\.)+[A-Za-z]{2,3}	Matches common e-mail addresses

These are only a few of the things you can do with regular expressions. If you want to find a period, then you need to escape it like this: \.. Unless it is inside a set of brackets, just use the period by itself. The \w construct is the same as using [A-Za-z0-9_]. Notice that I used [\w-]—this allows me to trap on any word character, including the dash.

In case you were wondering about the phone number expression, area codes cannot start with a 0 or 1. Therefore, I allow only 2 through 9 at the start of an area code. Note also that the phone number expression allows only one format. You can lengthen this regular expression considerably by allowing more formats such as a dash or a slash between numbers, or perhaps by making the area code optional.

There is one other thing to note here. If you are testing a whole string, it is best to anchor the regular expression at the beginning and at the end. Use a caret (^) as the first character in the expression and use a dollar sign ($) as the last. This allows you to test the string inclusive.

So, how do you use these regular expressions to validate something? Try these methods.

    //Matches string of consecutive numbers
    private static bool IsInteger(string number)
    {
      return(Regex.IsMatch(number, "^[+-]?[0-9]+$"));
    }

    //Matches string of consecutive letters
    private static bool IsAlpha(string str)
    {
      return(Regex.IsMatch(str, "^[A-Za-z]+$"));
    }

    //Checks for format of 5 or 5+4 zip code
    private static bool IsValidZip(string code)
    {
      return(Regex.IsMatch(code, "^\\d{5}(-\\d{4})?$"));
    }

    //Checks for format of most all email addresses
    private static bool IsValidEmail(string email)
    {
      return(Regex.IsMatch(email, "^([\\w-]+)@([\\w-]+\\.)+[A-Za-z]{2,3}$"));
    }

    //Checks for format of USA phone number
    private static bool IsValidPhone(string phone)
    {
      return(Regex.IsMatch(phone, "^([\\w-]+)@([\\w-]+\\.)+[A-Za-z]{2,3}$"));
    }

    //Checks for format of USA date
    //separators = /-.
    //format = xx/xx/xxxx or xx/xx/xx
    //Month and day must be within correct calendar range
    //Year can be anything either 2 or 4 digits
    private static bool IsValidUSAdate(string dt)
    {
      return(Regex.IsMatch(dt, "^(0[1-9]|1[0-2])[./-]" +
                               "(0[1-9]|1[0-9]|2[0-9]|3[0-1])" +
                               "[./-](\\d{2}|\\d{4})$"));
    }

    //Checks for format of military time
    private static bool IsValidMilitaryTime(string tm)
    {
      return(Regex.IsMatch(tm, "^([0-1][0-9]|2[0-3]):[0-5][0-9]$"));
      // ([0-1][0-9]|2[0-3]) Check for 00-19 OR 20-23 as hours
      // [0-5][0-9]          Check for 00-59 as minutes
    }

    //Checks for format of password
    //format = 6-15 characters
    //         Must include 2 consecutive digits
    //         Must include at least one lowercase letter
    //         Must include at least one uppercase letter
    private static bool IsPasswordFormatValid(string Pword)
    {
      return(Regex.IsMatch(Pword,"^(?=.*\\d{2})(?=.*[a-z])(?=.*[A-Z]).{6,15}$"));
      // ?= means look ahead in the string for what follows
      // (?=.*\\d{2}) Starting at the beginning find zero or more of any
      //              character and at 2 consecutive digits in the string.
      // (?=.*[a-z])  Starting at the beginning find zero or more of any
      //              character and a lowercase letter somewhere in the string.
      // (?=.*[A-Z])  Starting at the beginning find zero or more of any
      //              character and an uppercase letter somewhere in the string.
      // .{6,15}      With all else being equal, There must be between 6 and 15
      //              characters in the string
    }

  'Matches string of consecutive numbers
  Function IsInteger(ByVal number As String) As Boolean
    Return (Regex.IsMatch(number, "^[+-]?[0-9]+$"))
  End Function

  'Matches string of consecutive letters
  Function IsAlpha(ByVal str As String) As Boolean
    Return (Regex.IsMatch(str, "^[A-Za-z]+$"))
  End Function

  'Checks for format of 5 or 5+4 zip code
  Function IsValidZip(ByVal code As String) As Boolean
    Return (Regex.IsMatch(code, "^\\d{5}(-\\d{4})?$"))
  End Function

  'Checks for format of most all email addresses
  Function IsValidEmail(ByVal email As String) As Boolean
    Return (Regex.IsMatch(email, "^([\\w-]+)@([\\w-]+\\.)+[A-Za-z]{2,3}$"))
  End Function

  'Checks for format of USA phone number
  Function IsValidPhone(ByVal phone As String) As Boolean
    Return (Regex.IsMatch(phone, "^([\\w-]+)@([\\w-]+\\.)+[A-Za-z]{2,3}$"))
  End Function

  'Checks for format of USA date
  'separators = /-.
  'format = xx/xx/xxxx or xx/xx/xx
  'Month and day must be within correct calendar range
  'Year can be anything either 2 or 4 digits
  Function IsValidUSAdate(ByVal dt As String) As Boolean
    Return (Regex.IsMatch(dt, "^(0[1-9]|1[0-2])[./-]" + _
                              "(0[1-9]|1[0-9]|2[0-9]|3[0-1])" _
                              "[./-](\\d{2}|\\d{4})$"))
  End Function

  'Checks for format of military time
  Function IsValidMilitaryTime(ByVal tm As String) As Boolean
    Return (Regex.IsMatch(tm, "^([0-1][0-9]|2[0-3]):[0-5][0-9]$"))
    ' ([0-1][0-9]|2[0-3]) Check for 00-19 OR 20-23 as hours
    ' [0-5][0-9]          Check for 00-59 as minutes
  End Function
  'Checks for format of password
  'format = 6-15 characters
  '         Must include 2 consecutive digits
  '         Must include at least one lowercase letter
  '         Must include at least one uppercase letter
  Function IsPasswordFormatValid(ByVal Pword As String) As Boolean
    Return (Regex.IsMatch(Pword, "^(?=.*\\d{2})(?=.*[a-z])(?=.*[A-Z]).{6,15}$"))
    ' ?= means look ahead in the string for what follows
    ' (?=.*\\d{2}) Starting at the beginning find zero or more of any
    '              character and at 2 consecutive digits in the string.
    ' (?=.*[a-z])  Starting at the beginning find zero or more of any
    '              character and a lowercase letter somewhere in the string.
    ' (?=.*[A-Z])  Starting at the beginning find zero or more of any
    '              character and an uppercase letter somewhere in the string.
    ' .{6,15}      With all else being equal, There must be between 6 and 15
    '              characters in the string
  End Function

Every single one of these methods works. Check them out. Try coding some of these regular expressions and see how much space they take up.

Probably the most obscure one here is the password checker. It uses a construct I have not explicitly covered. The regular expression parser has the capability to look forward in a string from its cursor point or look behind it. I am using the positive look-ahead search character set (?=pattern). The comments explain what the construct is doing. For further details on other subexpression patterns like this, I suggest you consult the online help.

Some of these expressions can be tricky, but they have so much power to get you what you want. As a contrast, look at the following code. This is an alternate way of validating the password as opposed to the regular expression.

    private static bool LongWayPassword(string Pword)
    {
      //Check length first
      if(Pword.Length < 6 || Pword.Length > 15)
        return false;

      string upper = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
      string lower = upper.ToLower();
      bool FoundUpper = false;
      bool FoundLower = false;
      int NumsFound = 0;
      char[] chars = Pword.ToCharArray();
      foreach(char c in chars)
      {
        //look for at least one uppercase letter
        if(Char.IsUpper(c))
          FoundUpper = true;
        //look for at least one lowercase letter
        if(Char.IsLower(c))
          FoundLower = true;
        if(Char.IsNumber(c))
          NumsFound++;
      }
      if(FoundUpper && FoundLower && NumsFound > 1)
        return true;
      else
        return false;
    }

  Function LongWayPassword(ByVal Pword As String) As Boolean
    'Check length first
    If Pword.Length < 6 Or Pword.Length > 15 Then
      Return False
    End If

    Dim upper As String = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
    Dim lower As String = upper.ToLower()
    Dim FoundUpper As Boolean = False
    Dim FoundLower As Boolean = False
    Dim NumsFound As Int32 = 0

    Dim chars() As Char = Pword.ToCharArray()
    Dim c As Char
    For Each c In chars
      'look for at least one uppercase letter
      If Char.IsUpper(c) Then
        FoundUpper = True
      End If
      'look for at least one lowercase letter
      If Char.IsLower(c) Then
        FoundLower = True
      End If

      If Char.IsNumber(c) Then
        NumsFound += 1
      End If
    Next
    If FoundUpper And FoundLower And NumsFound > 1 Then
      Return True
    Else
      Return False
    End If
  End Function

So, for the C# code I saved some 20 lines of code by using the regular expression. For the VB code I saved about 25 lines of code. I saved not only code but also bugs. As you know, every line of code entered is a potential bug.

Regular expressions are not only a powerful but also an efficient way to perform search missions, as you saw with the password example. Wouldn't it be nice to have a TextBox that you could enter a regular expression as a property to be run against during validation? That's coming up toward the end of the chapter. For now, let's look at a very powerful control called the Masked Edit control.

^[1]I extracted this example from the online help

^[2]This is not Star Trek … yet.

^[3]Did you guess that?