I l@ve RuBoard

12.11 Autodetecting XML Encoding

Credit: Paul Prescod

12.11.1 Problem

You have XML documents that may use a large variety of Unicode encodings, and you need to find out which encoding each document is using.

12.11.2 Solution

This is a task that we need to code ourselves, rather than getting an existing package to perform it, if we want complete generality:

import codecs, encodings

""" Caller will hand this library a buffer and ask it to convert
it or autodetect the type. """

# None represents a potentially variable byte. "##" in the XML spec...
autodetect_dict={ # bytepattern     : ("name",
                (0x00, 0x00, 0xFE, 0xFF) : ("ucs4_be"),
                (0xFF, 0xFE, 0x00, 0x00) : ("ucs4_le"),
                (0xFE, 0xFF, None, None) : ("utf_16_be"),
                (0xFF, 0xFE, None, None) : ("utf_16_le"),
                (0x00, 0x3C, 0x00, 0x3F) : ("utf_16_be"),
                (0x3C, 0x00, 0x3F, 0x00) : ("utf_16_le"),
                (0x3C, 0x3F, 0x78, 0x6D): ("utf_8"),
                (0x4C, 0x6F, 0xA7, 0x94): ("EBCDIC")
                 }

def autoDetectXMLEncoding(buffer):
    """ buffer -> encoding_name
    The buffer should be at least four bytes long.
        Returns None if encoding cannot be detected.
        Note that encoding_name might not have an installed
        decoder (e.g., EBCDIC)
    """
    # A more efficient implementation would not decode the whole
    # buffer at once, but then we'd have to decode a character at
    # a time looking for the quote character, and that's a pain

    encoding = "utf_8" # According to the XML spec, this is the default
                       # This code successively tries to refine the default:
                       # Whenever it fails to refine, it falls back to
                       # the last place encoding was set
    bytes = byte1, byte2, byte3, byte4 = tuple(map(ord, buffer[0:4]))
    enc_info = autodetect_dict.get(bytes, None)

    if not enc_info: # Try autodetection again, removing potentially
                     # variable bytes
        bytes = byte1, byte2, None, None
        enc_info = autodetect_dict.get(bytes)

    if enc_info:
        encoding = enc_info # We have a guess...these are
                            # the new defaults

        # Try to find a more precise encoding using XML declaration
        secret_decoder_ring = codecs.lookup(encoding)[1]
        decoded, length = secret_decoder_ring(buffer)
        first_line = decoded.split("\n", 1)[0]
        if first_line and first_line.startswith(u"<?xml"):
            encoding_pos = first_line.find(u"encoding")
            if encoding_pos!=-1:
                # Look for double quotes
                quote_pos = first_line.find('"', encoding_pos)

                if quote_pos==-1:                 # Look for single quote
                    quote_pos = first_line.find("'", encoding_pos)

                if quote_pos>-1:
                    quote_char = first_line[quote_pos]
                    rest = first_line[quote_pos+1:]
                    encoding = rest[:rest.find(quote_char)]

    return encoding

12.11.3 Discussion

The XML specification describes the outlines of an algorithm for detecting the Unicode encoding that an XML document uses. This recipe implements this algorithm and helps your XML processing programs find out which encoding is being used by a specific document.

The default encoding (unless we can determine another one specifically) must be UTF-8, as this is part of the specifications that define XML. Certain byte patterns in the first four, or sometimes even just the first two, bytes of the text, can let us identify a different encoding. For example, if the text starts with the 2 bytes 0xFF, 0xFE we can be certain this is a byte-order mark that identifies the encoding type as little-endian (low byte before high byte in each character) and the encoding itself as UTF-16 (or the 32-bits-per-character UCS-4 if the next 2 bytes in the text are 0, 0).

If we get as far as this, we must also examine the first line of the text by decoding the text from a byte string into Unicode with the encoding determined so far, and detecting the first line-end '\n' character. If the first line begins with u'<?xml', it's an XML declaration and may explicitly specify an encoding by using the keyword encoding as an attribute. The nested if statements in the recipe check for that, and, if they find an encoding thus specified, the recipe returns it as the encoding it has determined. This step is absolutely crucial, since any text starting with the single-byte ASCII-like representation of the XML declaration, <?xml, would be otherwise erroneously identified as encoded in UTF-8, while its explicit encoding attribute may specify it as being, for example, one of the ISO-8859 standard encodings.

This code detects a variety of encodings, including some that are not yet supported by Python's Unicode decoders. So the fact that you can decipher the encoding does not guarantee that you can decipher the document itself!

12.11.4 See Also

Recipe 3.18 and Recipe 3.19; Unicode is a huge topic, but a recommended book is Unicode: A Primer, by Tony Graham (Hungry Minds, Inc.)梔etails are available at http://www.menteith.com/unicode/primer/.

I l@ve RuBoard