Grabbing posts with Python and RSS

Fredrik Lundh | July 2007

In Grabbing posts with Python, I used JSON to fetch recent posts from the link management site.


Here’s another approach, which uses their RSS interface instead, and a simple RSS 1.0 parser built on ElementTree:

import urllib
import xml.etree.ElementTree as ET # Python 2.5
# import elementtree.ElementTree as ET

def RSS(tag): return "{}" + tag
def DC(tag): return "{}" + tag

class Post(object):
    def __init__(self, item): = item.findtext(RSS("link"))
        self.title = item.findtext(RSS("title"))
        self.description = item.findtext(RSS("description"))
        self.pubdate = item.findtext(DC("date"))
        self.tags = item.findtext(DC("subject"), "").split()

def getposts(user, tag=""):
    if isinstance(tag, tuple):
        tag = "+".join(tag)
    uri = "" % (user, tag)
    tree = ET.parse(urllib.urlopen(uri))
    return map(Post, tree.getiterator(RSS("item")))

Note that the parser has a limited understanding of the RSS format; it just locates all RSS 1.0 item elements in the document, and pulls out the relevant subelements using findtext.

To try it out, you can do something like:

for post in getposts("effbot"):
    print, post.tags


Which, when I write this, gives me something like: ['humor', 'training'] ['fun', 'music'] ['berglin'] ['berglin'] ['animals']

See the previous article for more tips and tricks.

Lazy Parsing

The version of getposts used above pulls in the entire RSS document, and then uses getiterator to locate all item elements. Another, somewhat more elegant approach is to use ET’s iterparse interface to parse the document as it arrives, and yield populated Post objects as they’re being created:

def getposts(user, tag=""):
    if isinstance(tag, tuple):
        tag = "+".join(tag)
    uri = "" % (user, tag)
    for event, elem in ET.iterparse(urllib.urlopen(uri)):
        if elem.tag == RSS("item"):
            yield Post(elem)

This version has lower latency and uses less memory than the first version.


A Django site. rendered by a django application. hosted by webfaction.