We're back after a server migration that caused effbot.org to fall over a bit harder than expected. Expect some glitches.

Implementing Pull-Style XML Parsers

May 4, 2000 | Fredrik Lundh

This xml-dev posting inspired Paul Prescod to implement the xml.dom.pulldom module, which was later added to Python’s standard library. A similar technique is used in ElementTree’s iterparse API.

Q. I’m not sure how we would actually implement this. The only XML parser we have that supports a pull-style interface is RXP, and I’m not sure if we can convert the other interfaces to pull-style interfaces in a sensible way (at least not on a level as low as SAX) without storing the entire sequence of events.

Assuming that a pull-style parser is what I think it is, here’s how to convert any incremental parser (xmllib, sgmlop, expat, etc) to a pull-style parser:

import xmllib

START, DATA, END = "start", "data", "end"

class XMLPuller(xmllib.XMLParser):

    def __init__(self, stream):
        xmllib.XMLParser.__init__(self)
        self.__stream = stream
        self.__tokens = []

    def get(self):
        while not self.__tokens:
            data = self.__stream.read(10000)
            if not data:
                self.close()
                break
            self.feed(data)
        if self.__tokens:
            return self.__tokens.pop(0)
        return None # end of stream

    def unknown_starttag(self, tag, attr):
        self.__tokens.append(START, tag, attr)

    def handle_data(self, data):
        self.__tokens.append(DATA, data)

    def unknown_endtag(self, tag):
        self.__tokens.append(END, tag)

puller = XMLPuller(open("myfile.xml"))

while 1:
    next = puller.get()
    if not next:
        break
    print next

[source]