We're back after a server migration that caused effbot.org to fall over a bit harder than expected. Expect some glitches.

The htmlentitydefs module

This module contains a dictionary with lots of ISO Latin 1 character entities used by HTML.

Example: Using the htmlentitydefs module
# File: htmlentitydefs-example-1.py

import htmlentitydefs

entities = htmlentitydefs.entitydefs

for entity in "amp", "quot", "copy", "yen":
    print entity, "=", entities[entity]

$ python htmlentitydefs-example-1.py
amp = &
quot = "
copy = ©
yen = ¥

The following example shows how to combine regular expressions with this dictionary to translate entities in a string (the opposite of cgi.escape):

Example: Using the htmlentitydefs module to translate entities
# File: htmlentitydefs-example-2.py

import htmlentitydefs
import re
import cgi

pattern = re.compile("&(\w+?);")

def descape_entity(m, defs=htmlentitydefs.entitydefs):
    # callback: translate one entity to its ISO Latin value
    try:
        return defs[m.group(1)]
    except KeyError:
        return m.group(0) # use as is

def descape(string):
    return pattern.sub(descape_entity, string)

print descape("<spam&eggs>")
print descape(cgi.escape("<spam&eggs>"))

$ python htmlentitydefs-example-2.py
<spam&eggs>
<spam&eggs>

Finally, the following example shows how to use translate reserved XML characters and ISO Latin 1 characters to an XML string. This is similar to cgi.escape, but it also replaces non-ASCII characters.

Example: Escaping ISO Latin 1 entities
# File: htmlentitydefs-example-3.py

import htmlentitydefs
import re, string

# this pattern matches substrings of reserved and non-ASCII characters
pattern = re.compile(r"[&<>\"\x80-\xff]+")

# create character map
entity_map = {}

for i in range(256):
    entity_map[chr(i)] = "&#%d;" % i

for entity, char in htmlentitydefs.entitydefs.items():
    if entity_map.has_key(char):
        entity_map[char] = "&%s;" % entity

def escape_entity(m, get=entity_map.get):
    return string.join(map(get, m.group()), "")

def escape(string):
    return pattern.sub(escape_entity, string)

print escape("<spam&eggs>")
print escape("å i åa ä e ö")
$ python htmlentitydefs-example-3.py
&lt;spam&amp;eggs&gt;
&aring; i &aring;a &auml; e &ouml;