ElementTree Tidy HTML Tree Builder

July 6, 2003 | Fredrik Lundh

The TidyHTMLTreeBuilder parser can read (almost) arbitrary HTML files, and turn them into well-formed element trees. This parser uses a library version of Dave Raggett’s HTML Tidy utility to fix any problems with the HTML before converting it to XHTML (the XML version of HTML).

Note: If you don’t want to (or cannot) install binary Python extensions, you can use the TidyTools module in the standard ElementTree distribution. That module uses the command-line version of Tidy, which is available for many different platforms.

This tree builder requires the _elementtidy extension, which is based on the tidylib library. Note that this extension is not included in the current elementtree releases, but you can download a separate elementtidy package from effbot.org downloads site.

Usage #

Loading HTML Files #

To load an HTML file into an XHTML tree, import the TidyHTMLTreeBuilder module and call the parse method:

from elementtidy import TidyHTMLTreeBuilder

tree = TidyHTMLTreeBuilder.parse("myfile.htm")

Note: In the experimental alpha releases, the tree builder is installed in the elementtidy package. If you’re using a version shipped with the ElementTree library, import the module from the elementtree package instead.

Converting XHTML to HTML #

The ElementTree interfaces convert the HTML to the XML version of HTML, called XHTML. In this format, all HTML tags live in the {http://www.w3.org/1999/xhtml} namespace. The following code snippet shows how to ‘normalize’ the tree, turning it into standard HTML:

    XHTML = "{http://www.w3.org/1999/xhtml}"

    for elem in tree.getiterator():
        if elem.tag.startswith(XHTML):
            elem.tag = elem.tag[len(XHTML):]

Saving HTML Files #

To save a plain HTML file, just write out the tree.


This works well, as long as the file doesn’t containg any embedded SCRIPT or STYLE tags.

If you want, you can add a DTD reference to the beginning of the file:

    file = open("outfile.htm", "w")
    file.write(DTD + "\n")

Saving XHTML Files #

If you save an XHTML file (where each tag lives in the XHTML namespace), the write method will add a namespace declaration to the html element, and place every tag in an explicit namespace. Some browsers can’t handle this, and may fail to render your document properly.


A Django site. rendered by a django application. hosted by webfaction.