I l@ve RuBoard

3.15 Replacing Multiple Patterns in a Single Pass

Credit: Xavier Defrang

3.15.1 Problem

You need to perform several string substitutions on a string.

3.15.2 Solution

Sometimes regular expressions afford the fastest solution even in cases where their applicability is anything but obvious. In particular, the sub method of re objects makes regular expressions a good way to perform string substitutions. Here is how you can produce a result string from an input string where each occurrence of any key in a given dictionary is replaced by the corresponding value in the dictionary:

# requires Python 2.1 or later
from _ _future_ _ import nested_scopes

import re

# the simplest, lambda-based implementation
def multiple_replace(adict, text):
  # Create a regular expression from all of the dictionary keys
  regex = re.compile("|".join(map(re.escape, adict.keys(  ))))

  # For each match, look up the corresponding value in the dictionary
  return regex.sub(lambda match: adict[match.group(0)], text)

A more powerful and flexible approach is to wrap the dictionary into a callable object that directly supports the lookup and replacement idea, which you can use directly as the callback in the sub method. This object-oriented approach is more flexible because the callable object can keep its own state and therefore is easily extensible to other tasks. In Python 2.2 and later, you can create a class for this object by extending the dict built-in type, while in older Python versions you must fall back on UserDict.UserDict (built-in types were not subclassable in older versions). A try/except lets us easily write code that works optimally on both old and new versions of Python:

try: dict
except: from UserDict import UserDict as dict

class Xlator(dict):
    """ All-in-one multiple-string-substitution class """
    def _make_regex(self):
        """ Build re object based on the keys of the current dictionary """
        return re.compile("|".join(map(re.escape, self.keys(  ))))

    def _ _call_ _(self, match):
        """ Handler invoked for each regex match """
        return self[match.group(0)]

    def xlat(self, text):
        """ Translate text, returns the modified text. """
        return self._make_regex(  ).sub(self, text)

3.15.3 Discussion

This recipe shows how to use the Python standard re module to perform single-pass multiple-string substitution using a dictionary. Let's say you have a dictionary-based, one-to-one mapping between strings. The keys are the set of strings (or regular-expression patterns) you want to replace, and the corresponding values are the strings with which to replace them. You can perform the substitution by calling re.sub for each key/value pair in the dictionary, thus processing and creating a new copy of the whole text several times, but it is clearly better to do all the changes in a single pass, processing and creating a copy of the text only once. Fortunately, re.sub's callback facility makes this better approach quite easy.

First, we have to build a regular expression from the set of keys we want to match. Such a regular expression is a pattern of the form "a1|a2|...|an" and can easily be generated using a one-liner, as shown in the recipe. Then, instead of giving re.sub a replacement string, we call it with a callback argument. re.sub calls this object for each match, with a re.MatchObject as its only argument, and expects the replacement string as the call's result. In our case, the callback just has to look up the matched text in the dictionary and return the corresponding value.

The recipe has two implementations: one is lambda-based, and the other uses a callable, dictionary-like object. The second option is better if you want to perform additional processing on each match (e.g., build a histogram of the number of times each possible substitution is actually performed) or if you just dislike lambda. Another potential advantage of the class-based approach is performance. If you know that the translation dictionary is static, and you must apply the same translation to several input strings, you can move the _make_regex call from the xlat method, where it's currently done, to an _ _init_ _ method, to avoid repeatedly preparing and compiling the regular expression.

Here's a usage example for each half of this recipe. We would normally have it as a part of the same .py source file as the function and class shown in the recipe, so it is guarded by the traditional Python idiom that runs it if and only if the module is called as a main script:

if _ _name_ _ == "_ _main_ _":
    text = "Larry Wall is the creator of Perl"
    adict = {
      "Larry Wall" : "Guido van Rossum",
      "creator" : "Benevolent Dictator for Life",
      "Perl" : "Python",
    }

    print multiple_replace(adict, text)

    xlat = Xlator(adict)
    print xlat.xlat(text)

Substitutions such as those performed by this recipe are often intended to operate on entire words, rather than on arbitrary substrings. Regular expressions are good at picking up the beginnings and endings of words, thanks to the special sequence r'\b'. Thus, we can easily make a version of the Xlator class that is constrained to substitute only entire words:

class WordXlator(Xlator):
    """ An Xlator version to substitute only entire words """
    def _make_regex(self):
        return re.compile(
          r'\b'+r'\b|\b'.join(map(re.escape, self.keys(  )))+r'\b')

Note how much easier it is to customize Xlator than it would be to customize the multiple_replace function. Ease of customization by subclassing and overriding helps you avoid copy-and-paste coding, and this is another excellent reason to prefer object-oriented structures over simpler procedural ones. Of course, just because some functionality is packaged up as a class doesn't magically make it customizable in just the way you want. Customizability also takes some foresight in dividing the functionality into separately overridable methods that correspond to the right pieces of overall functionality. Fortunately, you don't have to get it right the first time; when code does not have the optimal internal structure for the task at hand (e.g., reuse by subclassing and selective overriding), you can and should refactor the code so that its internal structure serves your needs. Just make sure you have a suitable battery of tests ready to run to ensure that your refactoring hasn't broken anything, and then you can refactor to your heart's content. See http://www.refactoring.com for more information on the important art and practice of refactoring.

3.15.4 See Also

Documentation for the re module in the Library Reference; the Refactoring home page (http://www.refactoring.com).

I l@ve RuBoard