We're back after a server migration that caused effbot.org to fall over a bit harder than expected. Expect some glitches.

EffNews Part 4: Parsing More RSS Files

September 29, 2002 | Updated September 30, 2002 | Fredrik Lundh

This is the fourth article covering the effnews project; a simple RSS newsreader written in Python. Other articles in this series are available via this page.

Parsing RSS 2.0 Files With Undocumented Namespaces

In general, an implementation must be conservative in its sending behavior, and liberal in its receiving behavior. That is, it must be careful to send well-formed datagrams, but must accept any datagram that it can interpret (e.g., not object to technical errors where the meaning is still clear).

RFC 791, Internet Protocol, September 1981, Jon Postel (ed).

The RSS 2.0 standard (where RSS stands for “Really Simple Syndication”) is an attempt to upgrade the RSS 0.9x version of the format. It adds a number of new fields, designed with interactive RSS aggregators in mind, and also adds support for RSS extensions through custom namespaces.

As I’ve mentioned earlier, RSS 2.0 files comes in several flavours. Some providers strictly adhere to the RSS 2.0 specification and produce feeds where all the core elements (rss, item, title, etc) lives in XML’s standard namespace. This is a good thing, since it allows us to parse them with the same parser as we used for 0.9x; even if the version attribute on the rss element says 2.0, the parser will see an undecorated item tag, and will call the start_item handler.

Unfortunately, some tools generate RSS 2.0 feeds with all elements moved into a namespace. What’s worse, the RSS 2.0 specification doesn’t mention this namespace at all, and provides very few clues as how to deal with the presence or non-presence of namespaces on the core RSS elements.

For example, the scripting news feed contains the following declarations at the top:

<rss version="2.0"
    xmlns="http://backend.userland.com/rss2"
    xmlns:blogChannel="http://backend.userland.com/blogChannelModule">
  <channel>
    <title>Scripting News</title>
    <link>http://www.scripting.com/</link>
    ...

The xmlns=”http://backend.userland.com/rss2” attribute provides a default namespace. This means that unless you specify otherwise, the rss element and all its children will be assumed to belong to the http://backend.userland.com/rss2 namespace. Since our code looks for elements that belong to the standard namespace, the parser won’t find a thing.

(For more about how namespaces really work, I recommend James Clark’s XML Namespaces tutorial.)

Ignoring all namespaces

The xmllib library provides a fallback mechanism that can be used to deal with unknown elements. When the parser finds an element for which there is no start handler, it calls the unknown_starttag method with the element tag and a dictionary containing the attributes. Likewise, when an element ends and there’s no end handler, the parser calls the unknown_endtag method.

To see this in action, you can add stub versions of these methods to the rss_parser class, and run it on an RSS 2.0 feed:

class rss_parser(xmllib.XMLParser):

    ...

    def unknown_starttag(self, tag, attrib):
        print "START", repr(tag)
        if attrib:
            print attrib

    def unknown_endtag(self, tag):
        print "END", repr(tag)

    ...

Running this on the scriping news feed results in something like:

START 'http://backend.userland.com/rss2 rss'
{'http://backend.userland.com/rss2 version': '2.0'}
START 'http://backend.userland.com/rss2 channel'
START 'http://backend.userland.com/rss2 title'
END 'http://backend.userland.com/rss2 title'
START 'http://backend.userland.com/rss2 link'
END 'http://backend.userland.com/rss2 link'
START 'http://backend.userland.com/rss2 description'
END 'http://backend.userland.com/rss2 description'
...
START 'http://backend.userland.com/rss2 item'
START 'http://backend.userland.com/rss2 description'
END 'http://backend.userland.com/rss2 description'
START 'http://backend.userland.com/rss2 pubDate'
END 'http://backend.userland.com/rss2 pubDate'
START 'http://backend.userland.com/rss2 guid'
END 'http://backend.userland.com/rss2 guid'
END 'http://backend.userland.com/rss2 item'
...

As you can see, the xmllib parser combines the namespace string with the tag name into a single string, using a single space to separate the two parts.

One easy way to deal with the RSS 2.0 confusion is to ignore all namespaces. In the unknown handlers, you can simply split the tag name into two parts, and use the last part (known as the local part) to select the right method. Something like this could work:

    def unknown_starttag(self, tag, attrib):
        try:
            namespace, tag = tag.split()
        except ValueError:
            pass # ignore this tag
        else:
            if tag == "rss":
                self.start_rss(attrib)
            elif tag == "channel":
                self.start_channel(attrib)
            ... etc

    def unknown_endtag(self, tag):
        try:
            namespace, tag = tag.split()
        except ValueError:
            pass # ignore this tag
        else:
            if tag == "rss":
                self.end_rss()
            elif tag == "channel":
                self.end_channel()
            ... etc

To simplify the code, you can reuse portions of xmllib‘s existing tag dispatcher. To get the standard handler for a tag name, all you have to do is to look it up in the elements dictionary. This dictionary maps tag names to (start handler, end handler) tuples. By adding the following methods to the parser class, you get a parser that ignores the namespace for all elements:

    def unknown_starttag(self, tag, attrib):
        start, end = self._gethandlers(tag)
        if start:
            start(attrib)

    def unknown_endtag(self, tag):
        start, end = self._gethandlers(tag)
        if end:
            end()

    def _gethandlers(self, tag):
        try:
            namespace, tag = tag.split()
        except ValueError:
            pass # ignore this tag
        else:
            methods = self.elements.get(tag)
            if methods:
                return methods
        return None, None

This is almost enough to read the scripting news feed, but if you try it out, you’ll find that the parser raises a ParseError exception (not a valid RSS 0.9x file). A little more digging reveals that this exception is raised by the start_channel method, if the rss_version attribute is not set:

    def start_rss(self, attr):
        self.rss_version = attr.get("version")

    def start_channel(self, attr):
        if self.rss_version is None:
            raise ParseError("not a valid RSS 0.9x file")
        self.current = {}
        self.channel = self.current

If you look at the output from the stub version, you’ll notice that the attribute dictionary contains something called “http://backend.userland.com/rss2 version” instead of the version attribute we expected.

This is actually a bug in some versions of xmllib; it applies the default namespace not only to unqualified element names, but also to unqualified attribute names. When dealing with more complex formats, this bug can really get in our way, but we’re ignoring namespaces anyway in this case, so we can simply look for any attribute that has the right local part:

    def start_rss(self, attr):
        self.rss_version = attr.get("version")
        if self.rss_version is None:
            # no undecorated version attribute.  as a work-around,
            # just look at the local part
            for k, v in attr.items():
                if k.endswith(" version"):
                    self.rss_version = v
                    break

With these changes in place, we can use effnews.py to read the scripting news feed.

Almost, that is.

Compared to the other feeds, it doesn’t look quite right. Instead of a list of nice title/description items in the content pane, we get something far less friendly:

Reuters: <a href=”http://www.cnn.com/2002/WORLD/meast/09/28/turkey.uranium.reut/”>
Turkey seizes weapons-grade uranium</a>.

<a href=”http://doc.weblogs.com/discuss/msgReader$2489?mode=day”>
Phil Wolff</a>: “What would you be willing to do as a journalist
to improve your chances of getting your story listed on Google’s
front page for a prime time hour?”

There are no titles, and it looks as if the feed generator is putting HTML source code in the description, instead of the plain text description other feeds are using.

Obviously, you need to add some way to filter out the HTML elements from the description field, and possibly some way to generate a title line based on other information in the feed. This is a nice topic for a later article…

Ignoring only the backend.userland.com namespace

A problem with the current namespace workaround is that we don’t really care what namespace an element is using; every item element is assumed to be an RSS item, every title element is assumed to be an RSS title, and so on. But the RSS 2.0 specification explicitly allows RSS providers to use custom namespaces to add extra information, and nothing prevents them from reusing local names already in use by the RSS 2.0 specification.

Ignoring all namespace information might work for the moment, but it’s clearly not a future-proof solution.

Luckily, all you have to do to solve this is to add a single line to the _gethandlers method:

    def _gethandlers(self, tag):
        try:
            namespace, tag = tag.split()
        except ValueError:
            pass # ignore this tag
        else:
            if namespace == "http://backend.userland.com/rss2":
                methods = self.elements.get(tag)
                if methods:
                    return methods
        return None, None

With this test in place, the parser will treat RSS 0.9x elements, RSS 2.0 elements without a namespace, and RSS 2.0 elements in the http://backend.userland.com/rss2 namespaces as the same thing. All other elements will be ignored.

Allowing arbitrary namespaces for the core elements

The RSS 2.0 specification/sample mismatch could in fact be interpreted to mean that RSS 2.0 allows producers to use an arbitrary namespace for the RSS 2.0 elements. If I want to use http://effbot.org/schema/rss2, who can stop me?

To deal with this case, you can look at the namespace for the toplevel rss element, and allow other elements to have that namespace. Something like this might work:

    rss_namespace = None

    def _gethandlers(self, tag):
        try:
            namespace, tag = tag.split()
	    if tag == "rss" and not self.rss_namespace:
                self.rss_namespace = namespace
        except ValueError:
            pass # ignore this tag
        else:
            if namespace == self.rss_namespace:
                methods = self.elements.get(tag)
                if methods:
                    return methods
        return None, None

To quote a leading XML expert, requiring people to implement things like this would be “silly indeed”, so let’s hope that the RSS 2.0 crowd sorts this one out some day, before feed providers start doing really silly things…

Parsing RSS 1.0 Files

While we’re at it, let’s look at the third version of the RSS format. In RSS 1.0, the RSS stands for “RDF Site Summary”, where RDF stands for “Resource Description Framework”. RDF is building block in something called the Semantic Web, which is a research project that’s likely to impact your future life in pretty much the same way as AI research has done over the last 30-40 years. But I digress.

An RSS 1.0 file is cleverly designed to look as a valid RDF file to RDF tools, and as an RSS 0.91 file to (some) RSS tools. In practice, as a feed provider, this means that people can read your feed in dozens of different RSS viewers, and use it to draw mostly meaningless graphs consisting of circles and arrows. But I digress.

Here’s an excerpt from Mark Pilgrim’s RSS 1.0 feed (which contains the same data as his 2.0 feed that we used earlier):

<rdf:RDF
    xmlns="http://purl.org/rss/1.0/"
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    ...>
  <channel rdf:about="http://diveintomark.org/">
    <title>dive into mark</title>
    <link>http://diveintomark.org/</link>
    ...
  </channel>
  <item rdf:about="http://diveintomark.org/archives/2002/09/27.html#advanced_css_lists">
    <title>Advanced CSS lists</title>
    <description>Mark Newhouse: CSS Design: Taming Lists. ... </description>
    <link>http://diveintomark.org/archives/2002/09/27.html#advanced_css_lists</link>
    <dc:subject>CSS<dc:subject>
    <dc:date>2002-09-27T23:22:56-05:00</dc:date>
  </item>
  ...
</rdf:RDF>

This feed also uses a default namespace for all core elements. However, this is a documented namespace; all RSS 1.0 files are supposed to use this namespace. If it’s not there, it’s not an RSS 1.0 file.

Update 2002-11-16: Mark just told me that he’s no longer generating RSS 1.0 feeds, so the above link will take you to an RSS 2.0 feed.

Checking for multiple namespaces isn’t that much harder than checking for a single namespace. Here’s one way to do it:

# namespaces used for standard elements by different RSS formats
RSS_NAMESPACES = (
    "http://purl.org/rss/1.0/", # RSS 1.0
    "http://backend.userland.com/rss2", # RSS 2.0 (sometimes)
    )

class rss_parser(xmllib.XMLParser):

    ...

    def _gethandlers(self, tag):
        try:
            namespace, tag = tag.split()
        except ValueError:
            pass # ignore this tag
        else:
            if namespace in RSS_NAMESPACES:
                methods = self.elements.get(tag)
                if methods:
                    return methods
        return None, None

However, if you run this on an RSS 1.0 feed, you’ll get the same ParseError exception (not a valid RSS 0.9x file) as you got when tinkering with the 2.0 feeds, and for the same reason: the rss_version attribute is never set in the start_rss method.

If you look carefully at the RSS 2.0 sample, you’ll notice that there simply is no rss tag in the RSS 1.0 format. The root element is called RDF and lives in a www.w3.org namespace, so the start_rss handler will never be called.

There are several ways to fix this; the most obvious way is to look for the RDF start tag in the unknown_starttag handler, and set the rss_version attribute to something suitable. The downside is that if someone passes in an RDF file that doesn’t contain RSS 1.0 data, he’ll end up with an empty channel.

Another problem is that the effnews.py main application is using a end_rss handler to find out when we’re done parsing, so we have to change the parser interface as well.

And is it really a good idea to use the same code base for two radically different formats? Strictly speaking, RSS 1.0 files are RDF files, not XML files. Maybe we should use an RDF library to parse them, and extract the RSS information from the RDF data model? (This would also allow us to deal with feeds stored in alternative RDF representations.). But I digress.

To minimise the work, let’s settle for a compromise: we’ll keep the existing parser, and tweak it to generate the same events for an RSS 1.0 feed as it would generate for a corresponding RSS 0.9x or 2.0 feed. Turns out that this is really simple: just pretend that the RDF tag is really an rss tag without a version number, and check for some characteristic RSS 1.0 feature later on. The following example does the RDF-to-rss mapping in the _gethandlers method, and looks for the rdf:about attribute in the start_channel handler, if the version attribute wasn’t set by start_rss:

    def _gethandlers(self, tag):
        # check if the tag lives in a known RSS namespace
        if tag == "http://www.w3.org/1999/02/22-rdf-syntax-ns# RDF":
            # this appears to be an RDF file.  to simplify processing,
            # map this element to an "rss" element
            return self.elements.get("rss")
        try:
            namespace, tag = tag.split()
        except ValueError:
            pass # ignore
        else:
            if namespace in RSS_NAMESPACES:
                methods = self.elements.get(tag)
                if methods:
                    return methods
        return None, None

    ...

    def start_channel(self, attr):
        if self.rss_version is None:
            # no version attribute; it might still be an RSS 1.0 file.
            # check if this element has an rdf:about attribute
            if attr.get("http://www.w3.org/1999/02/22-rdf-syntax-ns# about"):
                self.rss_version = "1.0"
            else:
                raise ParseError("cannot read this RSS file")
        self.current = {}
        self.channel = self.current

Parsing RSS 0.9 Files

(Added September 30, 2002)

There’s actually one more RSS version out in the wild: the original RSS 0.9 format that Netscape created for their my.netscape.com portal. The portal still exists, but it hasn’t supported RSS feeds in a long time, and the RSS 0.9 specification is no longer available on the net. But some providers are still using this format.

Like 1.0, the RSS 0.9 format is based on RDF, but it uses a much simpler XML structure. Here’s an example:

<rdf:RDF
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns="http://my.netscape.com/rdf/simple/0.9/"<
  <channel>
    <title>Slashdot>/title>
    <link>http://slashdot.org/</link>
    <description>News for nerds, stuff that matters</description> 
  </channel>
  ...
  <item>
    <title>Undelete In Linux</title>
    <link>http://slashdot.org/article.pl?sid=02/09/30/1233220</link>
  </item>
  ...

Just like RSS 1.0, this format uses a toplevel RDF tag, and all the other tags live in a namespace. But the rest of the file looks just like your usual 0.91 feed, with titles, links, and (optional) descriptions.

(the RDF connection was removed by Netscape in a later revision, RSS 0.91.)

To add support for this format, you need to add the RSS 0.9 namespace to the RSS_NAMESPACES list. You also need to set the rss_version variable somewhere; there’s no version attribute on the root element, and the channel element doesn’t contain an rdf:about attribute. The simplest solution is to look for the Netscape namespace in the _gethandlers method:

# namespaces used for standard elements by different RSS formats
RSS_NAMESPACES = (
    "http://my.netscape.com/rdf/simple/0.9/", # RSS 0.9
    "http://purl.org/rss/1.0/", # RSS 1.0
    "http://backend.userland.com/rss2", # RSS 2.0 (sometimes)
    )

class rss_parser(xmllib.XMLParser):

    ...

    def _gethandlers(self, tag):
        if tag == "http://my.netscape.com/rdf/simple/0.9/ channel":
            # this appears to be a my.netscape.com 0.9 file
            self.rss_version = "0.9"
        try:
            namespace, tag = tag.split()
        except ValueError:
            pass # ignore this tag
        else:
            if namespace in RSS_NAMESPACES:
                methods = self.elements.get(tag)
                if methods:
                    return methods
        return None, None

Putting It All Together

For your convenience, here’s the updated parser, with additions in bold type. Just drop it in over the one from the second article, and you’ll be able to read most 0.9, 1.0 and 2.0 feeds:

Example: a slightly improved RSS parser (File: rss_parser.py)
import xmllib

# namespaces used for standard elements by different RSS formats
RSS_NAMESPACES = (
    "http://my.netscape.com/rdf/simple/0.9/", # RSS 0.9
    "http://purl.org/rss/1.0/", # RSS 1.0
    "http://backend.userland.com/rss2", # RSS 2.0 (sometimes)
    )

class ParseError(Exception):
    pass

class rss_parser(xmllib.XMLParser):

    def __init__(self):
        xmllib.XMLParser.__init__(self)
        self.rss_version = None
        self.channel = None
        self.current = None
        self.data_tag = None
        self.data = None
        self.items = []

    def _gethandlers(self, tag):
        # check if the tag lives in a known RSS namespace
        if tag == "http://www.w3.org/1999/02/22-rdf-syntax-ns# RDF":
            # this appears to be an RDF file.  to simplify processing,
            # map this element to an "rss" element
            return self.elements.get("rss")
        if tag == "http://my.netscape.com/rdf/simple/0.9/ channel":
            # this appears to be a my.netscape.com 0.9 file
            self.rss_version = "0.9"
        try:
            namespace, tag = tag.split()
        except ValueError:
            pass # ignore this element
        else:
            if namespace in RSS_NAMESPACES:
                methods = self.elements.get(tag)
                if methods:
                    return methods
        return None, None

    def unknown_starttag(self, tag, attrib):
        start, end = self._gethandlers(tag)
        if start:
            start(attrib)

    def unknown_endtag(self, tag):
        start, end = self._gethandlers(tag)
        if end:
            end()

    # stuff to deal with text elements.

    def _start_data(self, tag):
        if self.current is None:
            raise ParseError("%s tag not in channel or item element" % tag)
        self.data_tag = tag
        self.data = ""

    def handle_data(self, data):
        if self.data is not None:
            self.data = self.data + data

    handle_cdata = handle_data

    def _end_data(self):
        if self.data_tag:
            self.current[self.data_tag] = self.data or ""

    # main rss structure

    def start_rss(self, attr):
        self.rss_version = attr.get("version")
        if self.rss_version is None:
            # no undecorated version attribute.  as a work-around,
            # just look at the local names
            for k, v in attr.items():
                if k.endswith(" version"):
                    self.rss_version = v
                    break

    def start_channel(self, attr):
        if self.rss_version is None:
            # no version attribute; it might still be an RSS 1.0 file.
            # check if this element has an rdf:about attribute
            if attr.get("http://www.w3.org/1999/02/22-rdf-syntax-ns# about"):
                self.rss_version = "1.0"
            else:
                raise ParseError("cannot read this RSS file")
        self.current = {}
        self.channel = self.current

    def start_item(self, attr):
        if self.rss_version is None:
            raise ParseError("cannot read this RSS file")
        self.current = {}
        self.items.append(self.current)

    # content elements

    def start_title(self, attr):
        self._start_data("title")
    end_title = _end_data

    def start_link(self, attr):
        self._start_data("link")
    end_link = _end_data

    def start_description(self, attr):
        self._start_data("description")
    end_description = _end_data