The cElementTree Module
January 30, 2005 | Fredrik Lundh
The cElementTree module is a C implementation of the ElementTree API, optimized for fast parsing and low memory use. On typical documents, cElementTree is 15-20 times faster than the Python version of ElementTree, and uses 2-5 times less memory. On modern hardware, that means that documents in the 50-100 megabyte range can be manipulated in memory, and that documents in the 0-1 megabyte range load in zero time (0.0 seconds). This allows you to drastically simplify many kinds of XML applications.
Download and install #
cElementTree is included with Python 2.5 and later, as xml.etree.cElementTree.
If you’re using Linux or BSD systems, check your favourite package repository for python-celementtree or py-celementtree packages. Note that some distributors have included cElementTree in their ElementTree package. Mac OS X users may want to check the Fink repository.
To install binary distributions from effbot.org, download and run the installer, and follow the instructions on the screen. If the installer cannot find your Python interpreter, see this page.
To install from sources, simply unpack the distribution archive, change to the distribution directory, and run the setup.py script as follows:
$ python setup.py install
See the README and CHANGES files for more on installation, licensing (BSD style), changes since the last version, etc.
cElementTree is designed to work with Python 2.1 and newer. The iterparse mechanism is currently only supported for Python 2.2 and later. Earlier Python versions are not supported (let me know if you need support for 2.0 or 1.5.2). For best performance, use Python 2.4.
Note: Mandriva Linux ships with broken Python configuration files, and cannot be used to build Python extensions that rely on distutils feature tests. For a workaround, see this thread.
The cElementTree module is designed to replace the ElementTree module from the standard elementtree package. In theory, you should be able to simply change:
from elementtree import ElementTree
import cElementTree as ElementTree
in existing code, and run your programs without any problems (note that cElementTree is a module, not a package). (Let me know if you find that something you rely on doesn’t work as expected.)
cElementTree contains one new function, iterparse, which works like parse, but lets you track changes to the tree while it is being built. You can also modify and remove elements during the parse, as in this example, which processes “record” elements as they arrive, and then removes their contents from the tree.
import cElementTree for event, elem in cElementTree.iterparse(file): if elem.tag == "record": ... process record element ... elem.clear()
For more information about the ElementTree module, see Elements and Element Trees.
For more information about the iterparse interface, see The ElementTree iterparse Function.
Older versions only supports a small number of standard encodings. For a workaround, see Using Non-Standard Encodings in cElementTree.
Here are some benchmark figures, using a number of popular XML toolkits to parse a 3405k document-style XML file, from disk to memory:
library time space notes xml.dom.minidom (Python 2.1) 6.3 s 80000k (1) gnosis.objectify 2.0 s 22000k (5) xml.dom.minidom (Python 2.4) 1.4 s 53000k (1) ElementTree 1.2 1.6 s 14500k ElementTree 1.2.4/1.3 1.1 s 14500k cDomlette (C extension) 0.540 s 20500k (1) PyRXPU (C extension) 0.175 s 10850k (2) lxml.etree (C extension) (4) (4) (3) libxml2 (C extension) 0.098 s 16000k (3) readlines (read as utf-8) 0.093 s 8850k cElementTree (C extension) 0.047 s 4900k readlines (read as ascii) 0.032 s 5050k
The figures may of course vary somewhat depending on Python version, compiler, and platform. The above was measured with Python 2.4, using prebuilt Windows installers (as published by the maintainers) for all C extensions. If you want further details about the tests, drop me a line.
Several other toolkits were tested, but failed to parse the test file (which uses both non-ASCII characters and namespaces). Toolkits that parse namespaces but don’t handle them properly are included, though (see notes 2 and 5, below).
For comparision, here are some benchmarks for event-based parsers (using the same file as above, and enough dummy handlers to be able to handle complete elements and their character data contents):
library time throughput xml.sax (Python 2.1) 0.330 s 10300 k/s xml.sax (Python 2.4) 0.292 s 11700 k/s xml.parsers.expat 0.184 s 18500 k/s cElementTree XMLParser 0.124 s 27500 k/s sgmlop 0.092 s 37000 k/s cElementTree iterparse 0.071 s 48000 k/s
Note 1) For these toolkits, the looping variant of my benchmark behaves very badly, resulting in unexpected memory growth and wildly varying parsing times (typically 150-300% of the values in the table). Strategic use of forced garbage collection (gc.collect()) will usually make things better. Be careful.
Note 2) Even with namespace handling enabled, PyRXPU returns namespace prefixes instead of namespace URI:s, which makes it pretty much useless for namespace-aware XML processing. I’ve included it anyway, since it’s often put forth as the fastest XML parser you can get for Python.
Note 3) Tests on other platforms indicate that libxml2 is closer to cElementTree than this benchmark indicates. This is most likely a compiler-related issue (I’m using “official” Windows binaries for this benchmark, but so will most other users).
Note 4) There are no Windows binaries for
lxml.etree (dead link) yet, but
it uses libxml2’s parser and object model, so the timings for this
test should be very close to those for libxml2.
Note 5) An undocumented function (config_nspace_sep) must be called to enable namespace parsing. With that in place, the library parses the file without problems, but the resulting data structure depends on the namespace prefixes used in the file, rather than the namespace URI:s (also see note 2).