We're back after a server migration that caused effbot.org to fall over a bit harder than expected. Expect some glitches.

EffNews Part 2: Fetching and Parsing RSS Data

September 10, 2002 | Fredrik Lundh

This is the second article covering the effnews project; a simple RSS newsreader written in Python. Other articles in this series are available via this page.

Intermission: Did Anyone Spot The Error Message?

As some of you may have noticed, if you add the last code snippet from the previous article to the test program, a couple of strange-looking lines of text appears among the ok/failed messages:

online.effbot.org done
www.bbc.co.uk done
www.example.com failed
error: uncaptured python exception, closing channel <async_http
connected at 8eb07c> (exceptions.AttributeError:file_consumer
instance has no attribute 'file' [C:\py21\lib\asyncore.py|poll|95]
[C:\py21\lib\asyncore.py|handle_read_event|383]
[http_client.py|handle_read|77] [my-test-program.py|feed|15])
www.scripting.com done

(Directory names and line numbers may vary.)

The error: uncaptured python exception message is generated by asyncore‘s default error handler when a callback raises a Python exception. This message is actually a compact rendition of a standard python traceback, printed on a single line. Here’s the deciphered version:

www.bbc.co.uk done
www.example.com
Traceback (most recent call last):
  File C:\py21\lib\asyncore.py, line 95, in poll:
  File C:\py21\lib\asyncore.py, line 383, in handle_read_event:
  File http_client.py, line 77, in handle_read:
  File my-test-program.py, line 15, in feed:
AttributeError:file_consumer instance has no attribute 'file'
online.effbot.org done
www.scripting.com done

So what’s causing this error?

Note that the AttributeError occurs in the feed method, which is appears to be called despite the fact that the consumer did close the socket in the http_header method.

The http_client is supposed code to deal with this, by checking the connected flag attribute after calling the http_header consumer method. That flag was cleared by the close method in earlier versions of asyncore, but that was changed somewhere on the way from Python 1.5.2 to Python 2.1.

(And the reason I didn’t notice was sloppy testing: my test script contained enough debugging print statements to make me miss the error message. Sorry for that.)

Closing the Channel From the Consumer, Revisited

The obvious workaround is of course to explicitly clear the attribute in the consumer’s http_header method:

class file_consumer:
    def http_header(self, client):
        if (client.status[1] != "200" or
            client.header["content-type"] != "text/xml"):
            print client.host, "failed"
            client.close() # bail out
            client.connected = 0
            return
        self.host = client.host
        self.file = None
    ...

However, the connected flag is undocumented, and may (in theory) disappear in future versions of asyncore.

To make your code more future-proof, it’s better to use return value or an exception to indicate that the channel should be closed.

The following example uses a custom CloseConnection exception for this purpose:

class file_consumer:
    def http_header(self, client):
        if (client.status[1] != "200" or
            client.header["content-type"] != "text/xml"):
            print client.host, "failed"
            raise http_client.CloseConnection
        self.host = client.host
        self.file = None

Here are the necessary additions to the http_client module:

class CloseConnection(Exception):
    pass

...

try:
    self.consumer.http_header(self)
except CloseConnection:
    self.close()
    return

Overriding Asyncore’s Error Handling

The error message is printed by a method called handle_error. To change the look of the error message, you can override this in your dispatcher subclass. For example, here’s a version that prints a traditional traceback:

import traceback

class my_channel(asyncore.dispatcher_with_send):
    ...
    def handle_error(self):
        traceback.print_exc()
        self.close()
    ...

With the above lines added to the async_http class, you’ll get the following message instead:

www.bbc.co.uk done
www.example.com failed
Traceback (most recent call last):
  File "C:\py21\lib\asyncore.py", line 95, in poll
    obj.handle_read_event()
  File "C:\py21\lib\asyncore.py", line 383, in handle_read_event
    self.handle_read()
  File "http_client.py", line 77, in handle_read
    self.consumer.feed(data)
  File "my-test-program.py", line 15, in feed
    if self.file is None:
AttributeError: file_consumer instance has no attribute 'file'
online.effbot.org done
www.scripting.com done

Parsing RSS Files

As shown in the first article, an RSS file contains summary information about a (portion of a) site, including a list of current news items.

For both the channel itself and the items, the RSS file can contain a title, a link to an HTML page, and a description field:

<rss version="0.91">
  <channel>
    <title>the eff-bot online</title>
    <link>http://online.effbot.org</link>
    <description>Fredrik Lundh's clipbook.</description>
    <language>en-us>/language>
    ...
    <item>
      <title>spam, spam, spam</title>
      <link>http://online.effbot.org#85292735</link>
      <description>for the first seven months of 2002, the spam
      filters watching fredrik@pythonware.com has</description>
    </item>
    ...
  </channel>
</rss>

Note that the item elements are stored as child elements to the channel element. Both the channel element and the individual item elements may contain additional subelements, including the language element present in this example. We’ll look at some additional elements in a later article; for now, we’re only interested in the three basic elements.

XML Parsers

To parse an XML-based format like RSS, you need an XML parser. Python provides several ways to parse XML data, including the standard xmllib module which is a simple event-driven XML parser, the pyexpat parser and other components provided in the standard xml package, the PyXML extension library, and many others.

For the first version of the RSS parser, we’ll use the xmllib parser. You can plug in another parser if you need more features or better performance (and as you’ll see, chances are that you need more, or at least different features. More on this in a later article).

The xmllib parser works pretty much like the asyncore dispatcher; the module provides a parser base class that processes incoming data, and calls methods for different “XML events”. To handle the events, you should subclass the parser class, and implement methods for the events you need to deal with.

For the RSS parser, you need to implement the following methods:

  • start_TAG is called when the start tag (<TAG …>) for an element called TAG is found. The handler is called with a single argument, which is a dictionary containing the element attributes, if any.

  • end_TAG is called when the end tag (</TAG>) for an element called TAG is found.

  • handle_data is called for text between the elements (so-called character data). This handler is called with a single argument, a string containing the text. This method may be called more than once for any given character data segment.

For example, when parsing this XML fragment…

"<title>Odds &amp; Ends</title>\n"

…the xmllib parser will call the following methods:

self.start_title({})
self.handle_data("Odds ")
self.handle_data("&")
self.handle_data(" Ends")
self.end_title()
self.handle_data("\n")

Note that standard XML character entities like &amp; are decoded by the parser, and are passed to the handle_data method as ordinary character data.

If start or end handlers are missing for elements that appear in the XML document, the corresponding start or end tags are silently ignored by the parser (but character data inside the element is still passed to handle_data).

Here’s a minimal test program that implements a character data handler, and start and end tag handlers for the three RSS elements we’re interested in:

import xmllib

class rss_parser(xmllib.XMLParser):

    data = ""

    def start_title(self, attr):
        self.data = ""

    def end_title(self):
        print "TITLE", repr(self.data)

    def start_link(self, attr):
        self.data = ""

    def end_link(self):
        print "LINK", repr(self.data)

    def start_description(self, attr):
        self.data = ""

    def end_description(self):
        print "DESCRIPTION", repr(self.data)

    def handle_data(self, data):
        self.data = self.data + data

import sys

file = open(sys.argv[1])

parser = rss_parser()
parser.feed(file.read())
parser.close()

Note that the start methods set the data member to an empty string, the handle_data method adds text to that string, and the end handlers print out the string.

Also note that you pass in the raw RSS data to the parser’s feed method, and call close method when you’re done.

Here’s some sample output from this script (using the BBC newsfeed we downloaded earlier):

$ python rss-test.py www.bbc.co.uk.rss
TITLE 'BBC News | Front Page'
LINK 'http://news.bbc.co.uk/go/rss/-/1/hi/default.stm'
DESCRIPTION 'Updated every minute of every day'
TITLE 'BBC News Online'
LINK 'http://news.bbc.co.uk'
TITLE 'Blair and Bush talk tough on Iraq\r\n'
LINK 'http://news.bbc.co.uk/go/rss/-/1/hi/world/middle_east/2243684.stm
DESCRIPTION 'British PM Tony Blair says he has a "shared strategy" ...
TITLE "Al-Qaeda 'plotted nuclear attacks'"
LINK 'http://news.bbc.co.uk/go/rss/-/1/hi/world/middle_east/2244146.stm
DESCRIPTION 'Two alleged masterminds of the 11 September attacks ...
TITLE "Rix: 'Scum' will profit from Tube"
LINK 'http://news.bbc.co.uk/go/rss/-/1/hi/uk_politics/2244076.stm'
DESCRIPTION 'Train drivers\' union leader Mick Rix says profits ...
TITLE 'Ex-arms inspector defends Baghdad'
LINK 'http://news.bbc.co.uk/go/rss/-/1/hi/world/middle_east/2243627.stm
DESCRIPTION 'Scott Ritter\xb8 once head of UN inspectors in Iraq\xb8 ...
TITLE 'Police warning as flash floods hit city'
LINK 'http://news.bbc.co.uk/go/rss/-/1/hi/scotland/2244003.stm'
DESCRIPTION 'People are advised not to travel to Inverness after ...

The first title/link/description combination contains information about the site, the others contain information about individual items.

(Note that there are extra title and link values in first section. If you look in the source RSS file, you’ll notice that they come from an extra image element, which we can safely ignore for the moment.)

To get a usable RSS parser, all you have to do is to add some logic that checks where in the file we are, and adds element values to the right data structure.

In the following example, the element handlers update a common current dictionary attribute, which is set to point to either the channel information dictionary, or a dictionary for each item (stored in the items list). This version also does some very basic syntax checking.

Example: a simple RSS parser (File: rss_parser.py)
import xmllib

class ParseError(Exception):
    pass

class rss_parser(xmllib.XMLParser):

    def __init__(self):
        xmllib.XMLParser.__init__(self)
        self.rss_version = None
        self.channel = None
        self.current = None
        self.data_tag = None
        self.data = None
        self.items = []

    # stuff to deal with text elements

    def _start_data(self, tag):
        if self.current is None:
            raise ParseError("%s tag not in channel or item element" % tag)
        self.data_tag = tag
        self.data = ""

    def handle_data(self, data):
        if self.data is not None:
            self.data = self.data + data

    # cdata sections are handled as any other character data
    handle_cdata = handle_data

    def _end_data(self):
        if self.data_tag:
            self.current[self.data_tag] = self.data or ""

    # main rss structure

    def start_rss(self, attr):
        self.rss_version = attr.get("version")

    def start_channel(self, attr):
        if self.rss_version is None:
            raise ParseError("not a valid RSS 0.9x file")
        self.current = {}
        self.channel = self.current

    def start_item(self, attr):
        if self.rss_version is None:
            raise ParseError("not a valid RSS 0.9x file")
        self.current = {}
        self.items.append(self.current)

    # content elements

    def start_title(self, attr):
        self._start_data("title")
    end_title = _end_data

    def start_link(self, attr):
        self._start_data("link")
    end_link = _end_data

    def start_description(self, attr):
        self._start_data("description")
    end_description = _end_data

The _start_data and _end_data methods are used to switch on and off character data processing in handle_data.

Here’s a test script, which prints each item to standard output (via the end_item method).

import rss_parser, string, sys

class my_rss_parser(rss_parser.rss_parser):

    def end_item(self):
        item = self.items[-1]
        print string.strip(item.get("title") or "")
        print item.get("link")
        print item.get("description")
        print

for filename in sys.argv[1:]:
    file = open(filename)
    try:
        parser = my_rss_parser()
        parser.feed(file.read())
        parser.close()
    except:
        print "=== cannot parse %s:" % filename
        print "===", sys.exc_type, sys.exc_value

Incremental parsing

The above example reads the entire XML document from disk, and passes it to the parser in one go. The xmllib library also supports incremental parsing, allowing you to pass in XML fragments as you receive them. Just keep calling the feed method, and make sure to call close when you’re done. The parser framework will take care of the rest.

This feature is of course a perfect match for the http_client client class we developed in the first article; by plugging in a parser instance as the consumer, you can parse RSS items as they arrive over the network.

The following script provides an http_rss_parser class that adds the required http_header and http_failed methods to the parser, and uses an end_item handler to print incoming items:

import rss_parser, string

class http_rss_parser(rss_parser.rss_parser):

    def http_header(self, client):
        if (client.status[1] != "200" or
            client.header["content-type"] != "text/xml"):
            raise http_client.CloseConnection
        self.host = client.host

    def http_failure(self, client):
        pass

    def end_item(self):
        item = self.items[-1]
        print "   ", string.strip(item.get("title") or ""),
        print "[%s]" % self.host
        print "   ", string.strip(item.get("link") or "")
        print
        print item.get("description")
        print

Here’s a driver script that reads a list of URLs from a text file named channels.txt, and fires up one asynchonous client for each channel.

import asyncore, http_client

file = open("channels.txt")

for url in file.readlines():
    url = url.strip()
    if url:
        http_client.do_request(url, http_rss_parser())

asyncore.loop()

The output is a list of titles, links, and descriptions. Here’s an excerpt:

    Blair defiant over Iraq [www.bbc.co.uk]
    http://news.bbc.co.uk/go/rss/-/1/hi/uk_politics/2247366.stm

Prime Minister Tony Blair confronts his trade union critics ...

    arrgh! [online.effbot.org]
    http://online.effbot.org#85432883

"Kom kom nu hit min vän, för glädjen blir större när man delar ...

    Buffet killers [www.kottke.org]
    http://www.kottke.org/02/09/020910buffet_kille.html

We're in Las Vegas and it's buffet time. It's always buffet ...

Note: When I write this, the www.scripting.com channel has just switched to something that appears to be an experimental version of Dave Winer’s RSS 2.0, which moves all RSS tags into a default namespace. The xmllib parser always takes the namespace into account, so it won’t find a single thing in that channel. Hopefully, this will be fixed in a not too distant future.


That’s all for today.

In the next article, we’ll look at what happens if you add dozens or hundreds of channels to the channels.txt file, and discuss how to deal with that. We’ll also build a simple RSS viewer using the Tkinter library.

In the meantime, if you’re running Unix, and are using a modern mail client that highlights URLs embedded in text mails, you can mail yourself the output from this program and let your mail reader do the rest:

$ python getchannels.py | mail -s effnews yourself