We're back after a server migration that caused effbot.org to fall over a bit harder than expected. Expect some glitches.

Sorting XML Records with ElementTree

Fredrik Lundh | December 2007

The ElementTree library makes it really easy to sort records in record-oriented XML files. Just locate the record container element, extract the record elements and the sort keys into a list, sort the list based on the keys, and then use slice assignment to update the tree.

Let’s look at an example. First, we need an XML file that has a record structure. Here’s a simple phonebook, which contains a number of name/number pairs inside entry elements.

<phonebook>
  <entries>
    <entry>
      <name>Ned</name>
      <number>555-8904</number>
    </entry>
    <entry>
      <name>John</name>
      <number>555-5782</number>
    </entry>
    <entry>
      <name>Julius</name>
      <number>555-3642</number>
    </entry>
  </entries>
</phonebook>

Here’s how to sort this file on phone numbers:

import xml.etree.ElementTree as ET

tree = ET.parse("data.xml")

# this element holds the phonebook entries
container = tree.find("entries")

data = []
for elem in container:
    key = elem.findtext("number")
    data.append((key, elem))

data.sort()

# insert the last item from each tuple
container[:] = [item[-1] for item in data]

tree.write("new-data.xml")

This is very efficient, since all you’re doing is moving references around. For normal files, the slowest part is reading and writing the files; sorting takes “no time at all” (and you can parse in no time at all too, by using cElementTree instead of ElementTree).

To sort on multiple keys (e.g. first on number, and then on name), you can use the fact that Python compares tuples item by item:

for elem in container:
    number = elem.findtext("number")
    name = elem.findtext("name")
    data.append((number, name, elem))

The above code uses the exact values in the file, falling back to None if a field is not present. In practice, you probably want to process the fields somewhat to get them sorted in the right way; e.g.

import re

for elem in container:

    # sort numbers by digit runs
    number = elem.findtext("number", "999-99999")
    number = tuple(map(int, re.findall("\d+", number)))

    # sort names by lower case
    name = elem.findtext("name", "").lower()

    data.append((number, name, elem))

To round off, let’s look at a first script again. In that script, we used an explicit for-in loop to extract the keys and the elements from the phonebook container element:

data = []
for elem in container:
    key = elem.findtext("number")
    data.append((key, elem))

data.sort()

container[:] = [item[-1] for item in data]

In recent versions of Python, you can replace the for-in loop with a call to the sorted function. To extract the keys, pass in a key callback:

def getkey(elem):
    return elem.findtext("number")

container[:] = sorted(container, key=getkey)

The sorted function calls the getkey helper once for each item in the container, then sorts the elements according to the key value, and finally returns the sorted elements as a list object. Here’s the resulting script:

import xml.etree.ElementTree as ET

tree = ET.parse("data.xml")

def getkey(elem):
    return elem.findtext("number")

container = tree.find("entries")

container[:] = sorted(container, key=getkey)

tree.write("new-data.xml")

To sort on multiple keys, simply return a tuple from the helper function.