We're back after a server migration that caused effbot.org to fall over a bit harder than expected. Expect some glitches.

EffNews Part 3: Displaying RSS Data

September 26, 2002 | Fredrik Lundh

This is the third article covering the effnews project; a simple RSS newsreader written in Python. Other articles in this series are available via this page.

Storing Channel Lists

In the previous article, we ended up creating a simple utility that downloads a number of channels, parses their content, and writes titles, links and descriptions to the screen as plain text. The list of channels to read is stored in a text file, channels.txt.

Other RSS tools use a variety of file formats to store channel lists. One popular format is OPML (Outline Processor Markup Language), which is a simple XML-based format. An OPML file contains a head element that stores information about the OPML file itself, and a body element that holds a number of outline elements.

Each outline element can have any number of attributes. Common attributes include type (how to interpret other attributes) and text (what to display for this node in an outline viewer). Outline elements can be nested.

When storing RSS channels, the type attribute is set to rss, and channel information is stored in the title and xmlUrl attributes. Here’s an example:

<opml version="1.0">
<body>
<outline type="rss" title="bbc news"
  xmlUrl="http://www.bbc.co.uk/syndication/feeds/news/ukfs_news/front_page/rss091.xml" />
<outline type="rss" title="effbot.org"
  xmlUrl="http://online.effbot.org/rss.xml" />
<outline type="rss" title="scripting news"
  xmlUrl="http://www.scripting.com/rss.xml" />
<outline type="rss" title="mark pilgrim"
  xmlUrl="http://diveintomark.org/xml/rss2.xml" />
<outline type="rss" title="jason kottke"
  xmlUrl="http://www.kottke.org/index.xml" />
<outline type="rss" title="example"
  xmlUrl="http://www.example.com/rss.xml" />
</body>
</opml>

Parsing OPML

You can use the xmllib library to extract channel information from OPML files. The following parser class looks for outline tags, and collects titles and channel URLs from the attributes (Note that the parser looks for both xmlUrl and xmlurl attributes; both names are used in the documentation and samples I’ve seen).

Example: a simple OPML bookmark parser (File: opml_parser.py)
import xmllib

class ParseError(Exception):
    pass

class opml_parser(xmllib.XMLParser):

    def __init__(self):
        xmllib.XMLParser.__init__(self)
        self.channels = []

    def start_opml(self, attr):
        if attr.get("version", "1.0") != "1.0":
            raise ParseError("unknown OPML version")

    def start_outline(self, attr):
        channel = attr.get("xmlUrl") or attr.get("xmlurl")
        if channel:
            self.add_channel(attr.get("title"), channel)

    def add_channel(self, title, channel):
        # can be overridden
        self.channels.append((title, channel))


def load(file):

    file = open(file)

    parser = opml_parser()
    parser.feed(file.read())
    parser.close()

    return parser.channels

The load function feeds the content of an OPML file through the parser, and returns a list of (title, channel URL) pairs.

Here’s a simple script that uses the http_rss_parser class from the second article to fetch and render all channels listed in the channels.opml file:

import asyncore, http_client, opml_parser

channels = opml_parser.load("channels.opml")

for title, uri in channels:
    http_client.do_request(uri, http_rss_parser())

asyncore.loop()

Managing Downloads

You can find RSS channel collections in various places on the web, such as NewsIsFree and Syndic8. These sites have links to thousands of RSS channels from a wide variety of sources.

Most real people probably use a dozen feeds or so, but someone like the pirate Pugg (“For I am not your usual uncouth pirate, but refined and with a Ph.D., and therefore extremely high-strung“) would most likely want to subscribe to every feed under the sun. What would happen if he tried?

If you pass an OPML file containing a thousand feeds to the previous script, it will happily issue a thousand socket requests. Exactly what happens depends on your operating system, but it’s likely that it will run out of resources at some point (if you decide to try this out on your favourite platform, let me know what happens).

To avoid this problem, you can add requests to a queue, and make sure you never create more sockets than your computer can handle (leaving some room for other applications is also a nice thing to do).

Limiting the number of simultaneous connections

Here’s a simple manager class that never creates more than a given number of sockets:

Example: An HTTP connection manager class (File: http_manager.py)
import asyncore

class http_manager:

    max_connections = 4

    def __init__(self):
        self._queue = []

    def request(self, uri, consumer):
        self._queue.append((uri, consumer))

    def poll(self, timeout=0.1):
        # activate up to max_connections channels
        while self._queue and len(asyncore.socket_map) < self.max_connections:
            http_client.do_request(*self._queue.pop(0))
        # keep the network running
        asyncore.poll(timeout)
        # return non-zero if we should keep polling
        return len(self._queue) or len(asyncore.socket_map)

In this class, the request method adds URLs and consumer instances to an internal queue. The poll method makes sure at least max_connections asyncore objects are activated (asyncore keeps references to active sockets in the socket_map variable).

To use the manager, all you have to do is to create an instance of the http_manager class, call the request method for each channel you want fetch, and keep calling the poll method over and over again to keep the network traffic going:

manager = http_manager.http_manager()

manager.request(url, consumer)

while manager.poll(1):
    pass

Limiting the size of an RSS file

You can also use the manager for other purposes. For example, to prevent denial-of-service attacks from malicious (or confused) RSS providers, you can use the http client’s byte counters, and simply kill the socket if it has processed more than a given number of bytes:

    max_size = 1000000 # bytes

    for channel in asyncore.socket_map.values():
        if channel.bytes_in > self.max_size:
            channel.close()

Timeouts

Another useful feature is a time limit; instead of checking the byte counter, you can check the timestamp variable, and compare it to the current time:

    max_time = 30 # seconds

    now = time.time()
    for channel in asyncore.socket_map.values():
        if now - channel.timestamp > self.max_time:
            channel.close()

And of course, nothing stops you from checking both the size and the elapsed time in the same loop:

    now = time.time()
    for channel in asyncore.socket_map.values():
        if channel.bytes_in > self.max_size:
            channel.close()
        if now - channel.timestamp > self.max_time:
            channel.close()

Building a Simple User Interface

Okay, enough infrastructure. It’s time to start working on something that ordinary humans might be willing to use: a nice, welcoming, easy-to-use graphical front-end.

Introducing Tkinter

The Tkinter library provides a number of portable building blocks for graphical user interfaces. Code written for Tkinter runs, usually without any changes, on systems based on Windows, Unix (and Linux), as well as on the Macintosh.

The most important building blocks provided by Tkinter are the standard widgets. The term widget is used both for a piece of code that may control a region of the screen (a widget class) and a specific region controlled by that code (a widget instance). Tkinter provides about a dozen standard widgets, such as labels, input fields, and list boxes, and it’s also relatively easy to create new custom widgets.

In Tkinter, each widget is represented by a Python class. When you create an instance of that class, the Tkinter layer will create a corresponding widget and display it on the screen.

Each Tkinter widget must have a parent widget, which “owns” the widget. When the parent is moved, the child widget also moves. When the parent is destroyed, the child widget is destroyed as well.

Here’s an example:

from Tkinter import *

root = Tk()
root.title("example")

widget = Label(root, text="this is an example")
widget.pack()

mainloop()

This script creates a root window by calling the Tk widget constructor. It then calls the title method to set the window title, and uses the Label widget constructor to add a text label to the window. Note that the parent widget is passed in as the first argument, and that keyword arguments are used to specify the text.

The script then calls the pack method. This is a special method that tells Tkinter to display the label widget inside it’s parent (the root window, in this case), and to make the parent large enough to hold the label.

Finally, the script calls the mainloop function. This function starts an event loop that looks for events from the window system. This includes events like key presses, mouse actions, and drawing requests, which are passed on to the widget implementation.

For more information on Tkinter, see An Introduction to Tkinter and the other documentation available from python.org.

Prototyping the EffNews application window

For the first prototype, let’s use a standard two-panel interface, with a list of channels to the left, and the contents of the selected channel in a larger panel to the right.

The Tkinter library provides a standard Listbox widget that can be used for the channel list. This widget displays a number of text strings, and lets you select one item from the list (or many, depending on how the widget is configured).

To render the contents, it would be nice if we could render the title on a line in a distinct font, followed by the description in a more neutral style. Something like this:

High hopes for new Wembley
FA chief Adam Crozier says the new Wembley will be the best stadium in the world.

Archer moved from open prison
Lord Archer is being moved from his open prison after breaking its rules by attending a lunch party during a home visit.

For this purpose, you can use the Text widget. This widget allows you to display text in various styles, and it takes care of things like word wrapping and scrolling. (The Text widget can also be used as a full-fledged text editor, but that’s outside the scope for this series. At least right now.)

Before you start creating widgets, the newsreader script will need to do some preparations. The first part imports Tkinter and a few other modules, creates a download manager instance, and parses an OPML file to get the list of channels to load:

from Tkinter import *

import sys
import http_manager, opml_parser

manager = http_manager.http_manager()

if len(sys.argv) > 1:
    channels = opml_parser.load(sys.argv[1])
else:
    channels = opml_parser.load("channels.opml")

Note that you can pass in the name of an OPML file on the command line (sys.argv[0] is the name of the program, sys.argv[1] the first argument). If you leave out the file name, the script loads the channels.opml file.

The next step is to create the root window. At the top of the window, add a Frame widget that will act like a toolbar. The frame is an empty widget, which may have a background colour and a border, but no content of it’s own. Frames are mostly used to organize other widgets, like the buttons on the toolbar.

root = Tk()
root.title("effnews")

toolbar = Frame(root)
toolbar.pack(side=TOP, fill=X)

The toolbar is packed towards the top of the parent widget (the root window). The fill option tells the packer to make the widget as wide as its parent (instead of X, you can use Y to make it as high as the parent, and BOTH to fill in both directions).

For now, the only thing we’ll have in the toolbar is a reload button. When you click this button, the schedule_reloading function adds all channels to the manager queue.

def schedule_reloading():
    for title, channel in channels:
        manager.request(channel, http_rss_parser(channel))

b = Button(toolbar, text="reload", command=schedule_reloading)
b.pack(side=LEFT)

Here, the button is packed against the left side of the parent widget (the toolbar, not the root window). The command option is used to call a Python function when the button is pressed.

The http_rss_parser class used here is a variant of the consumer class with the same name that you’ve used earlier. It should parse RSS data, and store the incoming items somewhere. We’ll get to the code for this class in a moment.

Next, we’ll add a Tkinter Listbox widget, and fill it with channel titles. The listbox is packed against the left side of the parent widget, under the toolbar (which was packed before the listbox).

channel_listbox = Listbox(root, background="white")
channel_listbox.pack(side=LEFT, fill=Y)

for title, channel in channels:
    # load listbox
    channel_listbox.insert(END, title)

def select_channel(event):
    selection = channel_listbox.curselection()
    if selection:
        selection = int(selection[0])
        title, channel = channels[selection]
        update_content(channel)

channel_listbox.bind("<Double-Button-1>", select_channel)

The select_channel function is used to display the contents of a channel in the Text widget. The curselection method returns the indexes of all selected items. The indexes work like Python list idexes, but they are returned as strings. If the list is not empty (that is, if at least one item is selected), the index is converted to an integer, and used to get the channel URL from the channels list. The update_content function displays that channel in the text widget; we’ll get back to this function later in this article.

The bind call, finally, sets things up so that the select_channel function is called when the user double-clicks on an item in the listbox.

To complete the user interface, create a text widget for the channel contents. The widget is packed against the top of remaining space in the parent widget (it ends up under the toolbar, and to the right of the listbox). The fill option is used to make it fill the entire space, and the expand option tells Tkinter that if the user resizes the application window, the text widget gets any extra space.

content_pane = Text(root, wrap=WORD)
content_pane.pack(side=TOP, fill=BOTH, expand=1)

content_pane.tag_config("head", font="helvetica 12 bold", foreground="blue")
content_pane.tag_config("body", font="helvetica 10")

mainloop()

The tag_config methods are used to defined styles to use in the text widget. Here, we defined two styles; text using the head style is drawn in a 12-point bold Helvetica font, and coloured blue; text using the body style is drawn in a smaller Helvetica font, using the default colour.

That’s it.

Almost. You also need to implement the http_rss_parser parser and the update_content function.

Let’s start with the parser.

Storing the channel items

You can reuse the http_rss_parser classes from the previous article pretty much right away. All you have to do is to put the channel items somewhere, so they can be found by the update_content function.

The following example adds a channel identifier (the URL) as an object attribute, and uses it to store the collected items in a global dictionary when it reaches the end of the file. If the identifier matches the current_channel variable, it also calls the update_content function.

items = {}

class http_rss_parser(rss_parser.rss_parser):

    def __init__(self, channel):
        rss_parser.rss_parser.__init__(self)
        self._channel = channel

    def http_header(self, client):
        if (client.status[1] != "200" or
            client.header["content-type"] != "text/xml"):
            raise http_client.CloseConnection

    def http_failure(self, client):
        pass

    def end_rss(self):
        items[self._channel] = self.items
        if self._channel == current_channel:
            update_content(self._channel) # update display

Displaying channel items

The next piece of the puzzle is the update_content function. This function takes a channel identifier (the URL), and displays the items in the text window.

current_channel = None

def update_content(channel):

    global current_channel

    current_channel = channel

    # clear the text widget
    content_pane.delete(1.0, END)

    if not items.has_key(channel):
        content_pane.insert(END, "channel not loaded")
        return

    # add newsitems to the text widget
    for item in items[channel]:
        title = item.get("title")
        if title:
            content_pane.insert(END, title.strip() + "\n", "head")
        description = item.get("description")
        if description:
            content_pane.insert(END, description.strip() + "\n", "body")
        content_pane.insert(END, "\n")

The global current_channel variable keeps track of what’s currently displayed in the text widget. It is used by the parser class to update the widget, if the channel is being displayed.

Data may be missing from the items directionary, either because the parser haven’t finished yet, or because the channel could not be read or parsed. In this case, the function displays the text channel not loaded and returns. Otherwise, it loops over the items, and adds the titles and descriptions to the text widget. The third argument to insert is the style name.

Keeping the network traffic going

If you put the pieces together, you’ll find that the program is almost working. It creates the widgets and displays them, loads the channels into the listbox, and schedules a number of http requests. But that’s all that happens; the requests never finish.

To fix this, you need to keep the poll method of the download manager at regular intervals. The Tkinter library contains a convenient timer mechanism that you can use for this purpose; the after method is used to register a callback that will be called after a given period of time (given in milliseconds).

The following code sets things up so that the network will be polled about 10 times a second. It also schedules all channels for loading when the application is started, and selects the first item in the listbox before entering the Tkinter mainloop.

import traceback

# schedule all channels for loading
schedule_reloading()

def poll_network(root):
    try:
        manager.poll(0.1)
    except:
        traceback.print_exc()
    root.after(100, poll_network, root)

# start polling the network
poll_network(root)

# display the first channel, if there is one
if channels:
    channel_listbox.select_set(0)
    update_content(channels[0][1])

# start the user interface
mainloop()

Putting it all together

For your convenience, here’s the final script:

Example: The first user-interface prototype (File: effnews.py)
from Tkinter import *

import http_client, http_manager
import opml_parser
import rss_parser

import sys, traceback

#
# item database

items = {}

#
# parse channels, and store item lists in the global items dictionary

class http_rss_parser(rss_parser.rss_parser):

    def __init__(self, channel):
        rss_parser.rss_parser.__init__(self)
        self._channel = channel

    def http_header(self, client):
        if (client.status[1] != "200" or
            client.header["content-type"] != "text/xml"):
            raise http_client.CloseConnection

    def http_failure(self, client):
        pass

    def end_rss(self):
        items[self._channel] = self.items
        if self._channel == current_channel:
            update_content(self._channel) # update display

#
# globals

manager = http_manager.http_manager()

if len(sys.argv) > 1:
    channels = opml_parser.load(sys.argv[1])
else:
    channels = opml_parser.load("channels.opml")

#
# create the user interface

root = Tk()
root.title("effnews")

#
# toolbar

toolbar = Frame(root)
toolbar.pack(side=TOP, fill=X)

def schedule_reloading():
    for title, channel in channels:
        manager.request(channel, http_rss_parser(channel))

b = Button(toolbar, text="reload", command=schedule_reloading)
b.pack(side=LEFT)

#
# channels

channel_listbox = Listbox(root, background="white")
channel_listbox.pack(side=LEFT, fill=Y)

def select_channel(event):
    selection = channel_listbox.curselection()
    if selection:
        selection = int(selection[0])
        title, channel = channels[selection]
        update_content(channel)

channel_listbox.bind("<Double-Button-1>", select_channel)

for title, channel in channels:
    channel_listbox.insert(END, title)

#
# content panel

content_pane = Text(root, wrap=WORD)
content_pane.pack(side=TOP, fill=BOTH, expand=1)

content_pane.tag_config("head", font="helvetica 12 bold", foreground="blue")
content_pane.tag_config("body", font="helvetica 10")

current_channel = None

def update_content(channel):

    global current_channel

    current_channel = channel

    # clear the text widget
    content_pane.delete(1.0, END)

    if not items.has_key(channel):
        content_pane.insert(END, "channel not loaded")
        return

    # add newsitems to the text widget
    for item in items[channel]:
        title = item.get("title")
        if title:
            content_pane.insert(END, title.strip() + "\n", "head")
        description = item.get("description")
        if description:
            content_pane.insert(END, description.strip() + "\n", "body")
        content_pane.insert(END, "\n")

# get going

schedule_reloading()

def poll_network(root):
    try:
        manager.poll(0.1)
    except:
        traceback.print_exc()
    root.after(100, poll_network, root)

poll_network(root)

if channels:
    channel_listbox.select_set(0)
    update_content(channels[0][1]) # display first channel

mainloop()

If you run this script on the sample channel.opml file from the beginning of this article, you’ll get a window looking something like this:

The first channel is selected, and if everything goes well, the channel contents will appear in the window after a second or so. To display any other channel, double-click on the channel title in the listbox.

If the text won’t fit in the text widget, you can scroll the text by pressing the mouse pointer inside the widget and dragging up or down. (We’ll add scrollbars in the next article.)

To refresh the contents, click the reload button. All channels will be loaded from the servers, and the items listing in the text widget will be updated.

About the sample channels

The sample channels.opml file contains six channels. Only three of them are properly rendered by the current prototype.

The bbc news, effbot.org, and kottke channels all use the RSS 0.9x file format. However, as you may notice, the bbc news channel is the only one that works flawlessly.

The effbot.org channel is generated by the Blogger Pro tool, which has a tendency to mess up on non-US character encodings. Since some articles are written in Swedish, using ISO Latin-1 characters, you may find that the XML parser chokes on the contents. Blogger is also known to generate bad output if the source uses XML character entities. To deal with broken feeds like this, you need a more robust RSS parser.

The kottke channel is in a better shape (possibly because he’s not using odd european characters), but you may find that the description contains strange line endings and strange little boxes. The line endings are probably copied verbatim from the site’s source code; web browsers usually don’t care about line endings. And the boxes are carriage return characters that are also copied as is from the source code. Getting rid of the line feeds and the bogus whitespace characters should be straightforward.

The pilgrim feed uses the new RSS 2.0 format. RSS 2.0 is an extension to the 0.9x format that’s supposed to be fully backwards compatible, and the feed renders just fine in the current prototype.

The scripting news feed also uses the RSS 2.0 format, but it places all tags in an undocumented default namespace (http://backend.userland.com/rss2). As a result, the current prototype parser won’t find a single thing in that feed. (And as expected, all attempts to find out if this is a problem with the feed or with the documentation have failed. But that’s another story.)

The example channel, finally, contains a bogus URL, and results in a channel not loaded message. This is of course exactly what’s supposed to happen.


In the next article, we’ll continue working on the prototype, trying to turn it into a more useful and more robust application. We’ll look at ways to deal with possibly broken channels, such as the effbot.org and scripting news feeds.