Using ElementTrees for Pull-Style Parsing

April 25, 2003 | Fredrik Lundh

Note: In recent versions of ElementTree, the iterparse interface provides a more convenient (and faster) way to do this. See The ElementTree iterparse Function for more information and examples.

Do you need all data DOMmed at once? You may be able to have one DOM tree at a time, dropping and reloading everytime you switch file.

An alternative is to use an incremental tree builder, and process interesting subtrees as they arrive.

Here’s an example, using the elementtree module:

 
from elementtree import ElementTree

class MyBuilder(ElementTree.TreeBuilder):

    def end(self, tag):
        elem = ElementTree.TreeBuilder.end(self, tag)
        if elem.tag == "SCENE":
            # process(elem) in some way, and write it out, e.g.
            # ElementTree.ElementTree(elem).write(sys.output)
            elem.clear() # we're done with it

parser = ElementTree.XMLTreeBuilder()
parser._target = MyBuilder() # plug in a custom builder!

tree = ElementTree.parse(filename, parser)

The above example overrides the tree builder’s end method, looking for SCENE elements.

I’ve tested this with a 10 megabyte XML file created by concatenating Jon Bosak’s Hamlet XML file over and over again, and wrapping it all in a single document element.

The resulting file contains 720 scenes (about 15k each, in average).

The above script requires about 4.5 megabytes to run to completion, and about 2 minutes processing time (on a really slow machine).

If I comment out the elem.clear() call, the script requires about 75 megabytes, and about 15 minutes (13 of which were spent on swapping; I ran the test on a machine with 96 megabytes RAM and slow disks… ;-)

 

A Django site. rendered by a django application. hosted by webfaction.