We're back after a server migration that caused effbot.org to fall over a bit harder than expected. Expect some glitches.

EffNews Part 1: Fetching RSS Files

September 5, 2002 | Updated September 8, 2002 | Fredrik Lundh

RSS Files

The RSS file format is an XML-based file format that provides a “site summary”, that is, a brief summary of information published on a site. It’s usually used to provide a machine readable version of the contents on a news site or a weblog.

Depending on who you talk to, RSS means “Rich Site Summary“, “RDF Site Summary” or “Really Simple Syndication” (or perhaps “Really Small Something“). It was originally created by Netscape for use on their my.netscape.com site, and was later developed into two similar but slightly differing versions, RSS 0.9x/2.0 and RSS 1.0.

An RSS 0.9x file might look something like this:

<?xml version="1.0"?>
<rss version="0.91">
  <channel>
    <title>the eff-bot online</title>
    <link>http://online.effbot.org</link>
    <description>Fredrik Lundh's clipbook.</description>
    <language>en-us</language>
    ...
    <item>
      <title>spam, spam, spam</title>
      <link>http://online.effbot.org#85292735</link>
      <description>for the first seven months of 2002, the spam
      filters watching fredrik@pythonware.com has</description>
    </item>
    ...
  </channel>
</rss>

The content consists of some descriptive information (the site’s title, a link to an HTML rendering of the content, etc) and a number of item elements, each of which contains an item title, a link, and a (usually brief) description.

We’ll look into RSS parsing and other RSS formats in later articles. For now, we’re more interested in getting our hands on some RSS files to parse…

Using HTTP to Download Files

Like all other resources on a web, an RSS file is identified by a uniform resource locator (URI). A typical RSS URI might look something like:

http://online.effbot.org/rss.xml

To fetch this RSS file, the aggregator connects to the computer named online.effbot.org and issues an HTTP request, asking the server to return the document identified as /rss.xml.

Here’s a minimal HTTP request message that does exactly this:

GET /rss.xml HTTP/1.0
Host: online.effbot.org

The message should be followed by an empty line.

If everything goes well, the HTTP server responds with a status line, followed by a number of header lines, an empty line, and the RSS file itself:

HTTP/1.1 200 OK
Last-Modified: Tue, 03 Sep 2002 11:04:09 GMT
ETag: "1e49dc-dfa-3d749729"
Content-Length: 3578
Content-Type: text/xml
Connection: close

...RSS data...

Sending an HTTP request

Python makes it easy to issue HTTP requests. Here’s an example that uses the socket module, which is a low-level interface for network communication:

HOST = "online.effbot.org"
PATH = "/rss.xml"

import socket

sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)

sock.connect((HOST, 80))

sock.send("GET %s HTTP/1.0\r\nHost: %s\r\n\r\n" % (PATH, HOST))

while 1:
    text = sock.recv(2048)
    if not text:
        break
    print "read", len(text), "bytes"

s.close()

The socket.socket call creates a socket for the INET (internet) network, and of the STREAM (reliable byte stream) type. This is more commonly known as a TCP connection.

The connect method is used to connect to a remote computer. The method takes a tuple containing two values; the computer name and the port number to use on that computer. In this example, we’re using port 80 which is the standard port for HTTP.

The send method is used to send the HTTP request to the server. Note that lines are separated by both a carriage return (\r) and a newline (\n), and that there’s an extra empty line at the end of the request.

The recv method, finally, is used to read data from the socket. Like the standard read method, it returns an empty string when there’s no more data to read.

Using an HTTP support library

In addition to the low-level socket module, Python’s standard library comes with modules that support common network protocols, including HTTP. The most obvious choice, httplib is an intermediate-level library which provides only a thin layer on top of the socket library.

The urllib module provides a higher-level interface. It takes an URL, generates a full HTTP request, parses the response header, and returns a file-like object that can be used to read the rest of the response right off the server:

import urllib

file = urllib.urlopen("http://" + HOST + PATH)
text = file.read()

print "read", len(text), "bytes"

Asynchronous HTTP

A problem with both the low-level socket library and urllib is that you can only read data from one site at a time. If you use sockets, the connect and recv calls may block, waiting for the server to respond. If you use urllib, both the urlopen and the read methods may block for the same reason.

If the task here was to create some kind of batch RSS aggregator, the easiest solution would probably be to ignore this problem, and read one site at a time. Who cares if it takes one second or ten minutes to check all channels; it would take much longer to visit all the sites by hand anyway.

However, in an interactive application, it’s rather bad style to block for an unknown amount of time. The application must be able to download things in the background, without locking up the user interface.

There are a number of ways to address this (including background processes and threads), but in this project, we’ll use something called asynchronous sockets, as provided by Python’s asyncore module.

The asyncore module provides “reactive” sockets, meaning that instead of creating socket objects, and calling methods on them to do things, your code is called by the socket framework when something can be done. This approach is known as event-driven programming.

The asyncore module contains a basic dispatcher class that represents a reactive socket. There’s also an extension to that class called dispatcher_with_send, which adds buffered output.

For the HTTP client, all you have to do is to subclass the dispatcher_with_send class, and implement the following methods:

  • handle_connect is called when a connection is successfully established.

  • handle_expt is called when a connection fails (Windows only. On most other platforms, connection failures are indicated by errors when writing to, or reading from the socket).

  • handle_read is called when there are data waiting to be read from the socket. The callback should call the recv method to get the data.

  • handle_close is called when the socket is closed or reset.

Here’s a first version:

Example: a minimal asynchronous HTTP client (File: minimal_http_client.py)
import asyncore
import string, socket

class async_http(asyncore.dispatcher_with_send):
    # asynchronous http client

    def __init__(self, host, path):
        asyncore.dispatcher_with_send.__init__(self)

        self.host = host
        self.path = path

        self.header = None

        self.data = ""

        self.create_socket(socket.AF_INET, socket.SOCK_STREAM)
        self.connect((host, 80))

    def handle_connect(self):
        # connection succeeded; send request
        self.send(
            "GET %s HTTP/1.0\r\nHost: %s\r\n\r\n" %
                (self.path, self.host)
            )

    def handle_expt(self):
        # connection failed
        self.close()

    def handle_read(self):
        # deal with incoming data
        data = self.recv(2048)

        if not self.header:
            # check if we have a full header
            self.data = self.data + data
            try:
                i = string.index(self.data, "\r\n\r\n")
            except ValueError:
                return # no empty line; continue
            self.header = self.data[:i+2]
            print self.host, "HEADER"
            print
            print self.header
            data = self.data[i+4:]
            self.data = ""

        if data:
            print self.host, "DATA", len(data)

    def handle_close(self):
        self.close()

The constructor creates a socket, and issues a connection request. Unlike ordinary sockets, the asynchronous connect method returns immediately; the framework calls the handle_connect method once it’s finished. When this method is called, our class immediately issues an HTTP request for the given RSS file. The framework makes sure that the request is sent as soon as the network is ready.

When the remote computer gets the request, it returns a response message. As data arrives, the handle_read method is called over and over again, until there’s no more data to read. Our handle_read method starts by looking for the header section (or rather, the empty line that identifies the end of the header). After that, it simply prints DATA messages to standard output.

Let’s try this one out on a real site:

$ python
>>> from minimal_http_client import async_http
>>> async_http("online.effbot.org", "/rss.xml")
<async_http at 880294>
>>> import asyncore
>>> asyncore.loop()
online.effbot.org HEADER

HTTP/1.1 200 OK
Server: Apache/1.3.22 (Unix)
Last-Modified: Tue, 03 Sep 2002 11:04:09 GMT
ETag: "1e49dc-dfa-3d749729"
Content-Length: 3578
Content-Type: text/xml
Connection: close

online.effbot.org DATA 1139
online.effbot.org DATA 2048
online.effbot.org DATA 391

To issue a request, just create an instance of the async_http class. The instance registers itself with the asyncore framework, and all you have to do to run it is to call the asyncore.loop function.

The real advantage here is that you can issue multiple requests at once…

>>> async_http("www.scripting.com", "/rss.xml")
<async_http at 8da7a4>
>>> async_http("online.effbot.org", "/rss.xml")
<async_http at 8daf34>
>>> async_http("www.bbc.co.uk",
...     "/syndication/feeds/news/ukfs_news/front_page/rss091.xml")
<async_http at 8db364>
>>> asyncore.loop()

…and have the framework process all requests in parallel:

online.effbot.org HEADER
...
online.effbot.org DATA 1139
online.effbot.org DATA 2048
online.effbot.org DATA 391
www.scripting.com HEADER
...
www.scripting.com DATA 1189
www.scripting.com DATA 1460
www.bbc.co.uk HEADER
...
www.bbc.co.uk DATA 1766
www.bbc.co.uk DATA 712
www.scripting.com DATA 1460
www.scripting.com DATA 1460
www.scripting.com DATA 1158

(Actual headers omitted.)

The actual output may vary depending on your network connection, the servers, and the phase of the moon.

To get a bit more variation, put the above statements in a script and run the script a couple of times.

Storing the RSS Data

The code we’ve used this far simply prints information to the screen. Before moving on to parsing and display issues, let’s add some code to store the RSS data on disk.

The following version adds support for a consumer object, which is called when we’ve read the header, when data is arriving, and when there is no more data. A consumer should implement the following methods:

  • http_header(client) is called when we’ve read the HTTP header. It’s called with a reference to the client object, and can use attributes like status and header to inspect the response header.

  • http_failed(client) is similar to http_header, but is called if the framework fails to connect to the remote computer.

  • feed(data) is called when a number of bytes has been read from the remote computer, after the header has been read.

  • close() is called when there is no more data.

In addition to consumer support, the following code uses the mimetools module to parse the header into a dictionary-like structure, adds counters for incoming and outgoing data, and uses a factory method that knows how to pull an URL into pieces.

Example: an asynchronous HTTP client with consumer support (File: http_client.py)
import asyncore
import socket, time
import StringIO
import mimetools, urlparse

class async_http(asyncore.dispatcher_with_send):
    # asynchronous http client

    def __init__(self, host, port, path, consumer):
        asyncore.dispatcher_with_send.__init__(self)

        self.host = host
        self.port = port
        self.path = path

        self.consumer = consumer

        self.status = None
        self.header = None

        self.bytes_in = 0
        self.bytes_out = 0

        self.data = ""

        self.create_socket(socket.AF_INET, socket.SOCK_STREAM)
        self.connect((host, port))

    def handle_connect(self):
        # connection succeeded
        text = "GET %s HTTP/1.0\r\nHost: %s\r\n\r\n" % (self.path, self.host)
        self.send(text)
        self.bytes_out = self.bytes_out + len(text)

    def handle_expt(self):
        # connection failed; notify consumer
        self.close()
        self.consumer.http_failed(self)

    def handle_read(self):

        data = self.recv(2048)
        self.bytes_in = self.bytes_in + len(data)

        if not self.header:
            # check if we've seen a full header

            self.data = self.data + data

            header = self.data.split("\r\n\r\n", 1)
            if len(header) <= 1:
                return
            header, data = header

            # parse header
            fp = StringIO.StringIO(header)
            self.status = fp.readline().split(" ", 2)
            self.header = mimetools.Message(fp)

            self.data = ""

            self.consumer.http_header(self)

            if not self.connected:
                return # channel was closed by consumer

        if data:
            self.consumer.feed(data)

    def handle_close(self):
        self.consumer.close()
        self.close()

def do_request(uri, consumer):

    # turn the uri into a valid request
    scheme, host, path, params, query, fragment = urlparse.urlparse(uri)
    assert scheme == "http", "only supports HTTP requests"
    try:
        host, port = host.split(":", 1)
        port = int(port)
    except (TypeError, ValueError):
        port = 80 # default port
    if not path:
        path = "/"
    if params:
        path = path + ";" + params
    if query:
        path = path + "?" + query

    return async_http(host, port, path, consumer)

Here’s a small test program that uses the enhanced client and a “dummy” consumer class:

import http_client, asyncore

class dummy_consumer:
    def http_header(self, client):
        self.host = client.host
        print self.host, repr(client.status)
    def http_failed(self, client):
        print self.host, "failed"
    def feed(self, data):
        print self.host, len(data)
    def close(self):
        print self.host, "CLOSE"

URLS = (
    "http://online.effbot.org/rss.xml",
    "http://www.scripting.com/rss.xml",
    "http://www.bbc.co.uk/syndication/feeds" +
        "/news/ukfs_news/front_page/rss091.xml",
    "http://www.example.com/rss.xml",
    )

for url in URLS:
    http_client.do_request(url, dummy_consumer())

asyncore.loop()

Here’s some sample output from this test program. Note the 404 error code from the example.com site.

online.effbot.org ['HTTP/1.1', '200', 'OK\r\n']
online.effbot.org 1139
online.effbot.org 1460
online.effbot.org 979
online.effbot.org CLOSE
www.bbc.co.uk ['HTTP/1.1', '200', 'OK\r\n']
www.bbc.co.uk 1766
www.bbc.co.uk 711
www.scripting.com ['HTTP/1.1', '200', 'OK\r\n']
www.scripting.com 1189
www.bbc.co.uk CLOSE
www.scripting.com 1460
www.example.com ['HTTP/1.1', '404', 'Not Found\r\n']
www.example.com 269
www.example.com CLOSE
www.scripting.com 1460
www.scripting.com 1460
www.scripting.com 1158
www.scripting.com CLOSE

To store things on disk, replace the dummy with a version that writes data to a file:

class file_consumer:

    def http_header(self, client):
        self.host = client.host
        self.file = None

    def http_failed(self, client):
        pass

    def feed(self, data):
        if self.file is None:
            self.file = open(self.host + ".rss", "w")
        self.file.write(data)

    def close(self):
        if self.file is not None:
            print self.host + ".rss ok"
            self.file.close()
        self.file = None

If you modify the test program to use this consumer instead of the dummy version, it’ll print something like this:

online.effbot.org.rss ok
www.example.com.rss ok
www.bbc.co.uk.rss ok
www.scripting.com.rss ok

Three of the four files contain current RSS data. The fourth (from example.com) contains an HTML error message. To avoid storing error messages, it’s probably a good idea to let the consumer check the status field as well as the Content-Type header field. You can do this in the http_header method:

class file_consumer:

    def http_header(self, client):
        if (client.status[1] != "200" or
            client.header["content-type"] != "text/xml"):
            print client.host, "failed"
            client.close() # bail out
            client.connected = 0
            return
        self.host = client.host
        self.file = None

    ...

Note that consumer can simply call the client’s close method to shut down the connection. The client contains code that checks that it’s still connected after the http_header call, and avoids calling other consumer methods if it’s not.

Update 2002-09-08: not all versions of asyncore clears the connected attribute when the socket is closed. For example, the version shipped with Python 1.5.2 does, the version shipped with 2.1 doesn’t. To be on the safe side, you have to clear the flag yourself in the consumer.

:::

That’s all for today. In the next article, we’ll look at how to parse at least some variant of the RSS format into a more useful data format.

While waiting, feel free to play with the code we’ve produced this far. Also, don’t forget to take a look at the RSS data files we just downloaded. Mark Nottingham’s RSS tutorial contains links to more information on various RSS formats.