EffNews Part 1: Fetching RSS Files
September 5, 2002 | Updated September 8, 2002 | Fredrik Lundh
The RSS file format is an XML-based file format that provides a “site summary”, that is, a brief summary of information published on a site. It’s usually used to provide a machine readable version of the contents on a news site or a weblog.
Depending on who you talk to, RSS means “Rich Site Summary“, “RDF Site Summary” or “Really Simple Syndication” (or perhaps “Really Small Something“). It was originally created by Netscape for use on their my.netscape.com site, and was later developed into two similar but slightly differing versions, RSS 0.9x/2.0 and RSS 1.0.
An RSS 0.9x file might look something like this:
<?xml version="1.0"?> <rss version="0.91"> <channel> <title>the eff-bot online</title> <link>http://online.effbot.org</link> <description>Fredrik Lundh's clipbook.</description> <language>en-us</language> ... <item> <title>spam, spam, spam</title> <link>http://online.effbot.org#85292735</link> <description>for the first seven months of 2002, the spam filters watching firstname.lastname@example.org has</description> </item> ... </channel> </rss>
The content consists of some descriptive information (the site’s title, a link to an HTML rendering of the content, etc) and a number of item elements, each of which contains an item title, a link, and a (usually brief) description.
Using HTTP to Download Files
Like all other resources on a web, an RSS file is identified by a uniform resource locator (URI). A typical RSS URI might look something like:
To fetch this RSS file, the aggregator connects to the computer named online.effbot.org and issues an HTTP request, asking the server to return the document identified as /rss.xml.
Here’s a minimal HTTP request message that does exactly this:
GET /rss.xml HTTP/1.0 Host: online.effbot.org
The message should be followed by an empty line.
If everything goes well, the HTTP server responds with a status line, followed by a number of header lines, an empty line, and the RSS file itself:
HTTP/1.1 200 OK Last-Modified: Tue, 03 Sep 2002 11:04:09 GMT ETag: "1e49dc-dfa-3d749729" Content-Length: 3578 Content-Type: text/xml Connection: close ...RSS data...
Sending an HTTP request
Python makes it easy to issue HTTP requests. Here’s an example that uses the socket module, which is a low-level interface for network communication:
HOST = "online.effbot.org" PATH = "/rss.xml" import socket sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) sock.connect((HOST, 80)) sock.send("GET %s HTTP/1.0\r\nHost: %s\r\n\r\n" % (PATH, HOST)) while 1: text = sock.recv(2048) if not text: break print "read", len(text), "bytes" s.close()
The socket.socket call creates a socket for the INET (internet) network, and of the STREAM (reliable byte stream) type. This is more commonly known as a TCP connection.
The connect method is used to connect to a remote computer. The method takes a tuple containing two values; the computer name and the port number to use on that computer. In this example, we’re using port 80 which is the standard port for HTTP.
The send method is used to send the HTTP request to the server. Note that lines are separated by both a carriage return (\r) and a newline (\n), and that there’s an extra empty line at the end of the request.
The recv method, finally, is used to read data from the socket. Like the standard read method, it returns an empty string when there’s no more data to read.
Using an HTTP support library
In addition to the low-level socket module, Python’s standard library comes with modules that support common network protocols, including HTTP. The most obvious choice, httplib is an intermediate-level library which provides only a thin layer on top of the socket library.
The urllib module provides a higher-level interface. It takes an URL, generates a full HTTP request, parses the response header, and returns a file-like object that can be used to read the rest of the response right off the server:
import urllib file = urllib.urlopen("http://" + HOST + PATH) text = file.read() print "read", len(text), "bytes"
A problem with both the low-level socket library and urllib is that you can only read data from one site at a time. If you use sockets, the connect and recv calls may block, waiting for the server to respond. If you use urllib, both the urlopen and the read methods may block for the same reason.
If the task here was to create some kind of batch RSS aggregator, the easiest solution would probably be to ignore this problem, and read one site at a time. Who cares if it takes one second or ten minutes to check all channels; it would take much longer to visit all the sites by hand anyway.
However, in an interactive application, it’s rather bad style to block for an unknown amount of time. The application must be able to download things in the background, without locking up the user interface.
There are a number of ways to address this (including background processes and threads), but in this project, we’ll use something called asynchronous sockets, as provided by Python’s asyncore module.
The asyncore module provides “reactive” sockets, meaning that instead of creating socket objects, and calling methods on them to do things, your code is called by the socket framework when something can be done. This approach is known as event-driven programming.
The asyncore module contains a basic dispatcher class that represents a reactive socket. There’s also an extension to that class called dispatcher_with_send, which adds buffered output.
For the HTTP client, all you have to do is to subclass the dispatcher_with_send class, and implement the following methods:
handle_connect is called when a connection is successfully established.
handle_expt is called when a connection fails (Windows only. On most other platforms, connection failures are indicated by errors when writing to, or reading from the socket).
handle_read is called when there are data waiting to be read from the socket. The callback should call the recv method to get the data.
handle_close is called when the socket is closed or reset.
Here’s a first version:
import asyncore import string, socket class async_http(asyncore.dispatcher_with_send): # asynchronous http client def __init__(self, host, path): asyncore.dispatcher_with_send.__init__(self) self.host = host self.path = path self.header = None self.data = "" self.create_socket(socket.AF_INET, socket.SOCK_STREAM) self.connect((host, 80)) def handle_connect(self): # connection succeeded; send request self.send( "GET %s HTTP/1.0\r\nHost: %s\r\n\r\n" % (self.path, self.host) ) def handle_expt(self): # connection failed self.close() def handle_read(self): # deal with incoming data data = self.recv(2048) if not self.header: # check if we have a full header self.data = self.data + data try: i = string.index(self.data, "\r\n\r\n") except ValueError: return # no empty line; continue self.header = self.data[:i+2] print self.host, "HEADER" print print self.header data = self.data[i+4:] self.data = "" if data: print self.host, "DATA", len(data) def handle_close(self): self.close()
The constructor creates a socket, and issues a connection request. Unlike ordinary sockets, the asynchronous connect method returns immediately; the framework calls the handle_connect method once it’s finished. When this method is called, our class immediately issues an HTTP request for the given RSS file. The framework makes sure that the request is sent as soon as the network is ready.
When the remote computer gets the request, it returns a response message. As data arrives, the handle_read method is called over and over again, until there’s no more data to read. Our handle_read method starts by looking for the header section (or rather, the empty line that identifies the end of the header). After that, it simply prints DATA messages to standard output.
Let’s try this one out on a real site:
$ python >>> from minimal_http_client import async_http >>> async_http("online.effbot.org", "/rss.xml") <async_http at 880294> >>> import asyncore >>> asyncore.loop() online.effbot.org HEADER HTTP/1.1 200 OK Server: Apache/1.3.22 (Unix) Last-Modified: Tue, 03 Sep 2002 11:04:09 GMT ETag: "1e49dc-dfa-3d749729" Content-Length: 3578 Content-Type: text/xml Connection: close online.effbot.org DATA 1139 online.effbot.org DATA 2048 online.effbot.org DATA 391
To issue a request, just create an instance of the async_http class. The instance registers itself with the asyncore framework, and all you have to do to run it is to call the asyncore.loop function.
The real advantage here is that you can issue multiple requests at once…
>>> async_http("www.scripting.com", "/rss.xml") <async_http at 8da7a4> >>> async_http("online.effbot.org", "/rss.xml") <async_http at 8daf34> >>> async_http("www.bbc.co.uk", ... "/syndication/feeds/news/ukfs_news/front_page/rss091.xml") <async_http at 8db364> >>> asyncore.loop()
…and have the framework process all requests in parallel:
online.effbot.org HEADER ... online.effbot.org DATA 1139 online.effbot.org DATA 2048 online.effbot.org DATA 391 www.scripting.com HEADER ... www.scripting.com DATA 1189 www.scripting.com DATA 1460 www.bbc.co.uk HEADER ... www.bbc.co.uk DATA 1766 www.bbc.co.uk DATA 712 www.scripting.com DATA 1460 www.scripting.com DATA 1460 www.scripting.com DATA 1158
(Actual headers omitted.)
The actual output may vary depending on your network connection, the servers, and the phase of the moon.
To get a bit more variation, put the above statements in a script and run the script a couple of times.
Storing the RSS Data
The code we’ve used this far simply prints information to the screen. Before moving on to parsing and display issues, let’s add some code to store the RSS data on disk.
The following version adds support for a consumer object, which is called when we’ve read the header, when data is arriving, and when there is no more data. A consumer should implement the following methods:
http_header(client) is called when we’ve read the HTTP header. It’s called with a reference to the client object, and can use attributes like status and header to inspect the response header.
http_failed(client) is similar to http_header, but is called if the framework fails to connect to the remote computer.
feed(data) is called when a number of bytes has been read from the remote computer, after the header has been read.
close() is called when there is no more data.
In addition to consumer support, the following code uses the mimetools module to parse the header into a dictionary-like structure, adds counters for incoming and outgoing data, and uses a factory method that knows how to pull an URL into pieces.
import asyncore import socket, time import StringIO import mimetools, urlparse class async_http(asyncore.dispatcher_with_send): # asynchronous http client def __init__(self, host, port, path, consumer): asyncore.dispatcher_with_send.__init__(self) self.host = host self.port = port self.path = path self.consumer = consumer self.status = None self.header = None self.bytes_in = 0 self.bytes_out = 0 self.data = "" self.create_socket(socket.AF_INET, socket.SOCK_STREAM) self.connect((host, port)) def handle_connect(self): # connection succeeded text = "GET %s HTTP/1.0\r\nHost: %s\r\n\r\n" % (self.path, self.host) self.send(text) self.bytes_out = self.bytes_out + len(text) def handle_expt(self): # connection failed; notify consumer self.close() self.consumer.http_failed(self) def handle_read(self): data = self.recv(2048) self.bytes_in = self.bytes_in + len(data) if not self.header: # check if we've seen a full header self.data = self.data + data header = self.data.split("\r\n\r\n", 1) if len(header) <= 1: return header, data = header # parse header fp = StringIO.StringIO(header) self.status = fp.readline().split(" ", 2) self.header = mimetools.Message(fp) self.data = "" self.consumer.http_header(self) if not self.connected: return # channel was closed by consumer if data: self.consumer.feed(data) def handle_close(self): self.consumer.close() self.close() def do_request(uri, consumer): # turn the uri into a valid request scheme, host, path, params, query, fragment = urlparse.urlparse(uri) assert scheme == "http", "only supports HTTP requests" try: host, port = host.split(":", 1) port = int(port) except (TypeError, ValueError): port = 80 # default port if not path: path = "/" if params: path = path + ";" + params if query: path = path + "?" + query return async_http(host, port, path, consumer)
Here’s a small test program that uses the enhanced client and a “dummy” consumer class:
import http_client, asyncore class dummy_consumer: def http_header(self, client): self.host = client.host print self.host, repr(client.status) def http_failed(self, client): print self.host, "failed" def feed(self, data): print self.host, len(data) def close(self): print self.host, "CLOSE" URLS = ( "http://online.effbot.org/rss.xml", "http://www.scripting.com/rss.xml", "http://www.bbc.co.uk/syndication/feeds" + "/news/ukfs_news/front_page/rss091.xml", "http://www.example.com/rss.xml", ) for url in URLS: http_client.do_request(url, dummy_consumer()) asyncore.loop()
Here’s some sample output from this test program. Note the 404 error code from the example.com site.
online.effbot.org ['HTTP/1.1', '200', 'OK\r\n'] online.effbot.org 1139 online.effbot.org 1460 online.effbot.org 979 online.effbot.org CLOSE www.bbc.co.uk ['HTTP/1.1', '200', 'OK\r\n'] www.bbc.co.uk 1766 www.bbc.co.uk 711 www.scripting.com ['HTTP/1.1', '200', 'OK\r\n'] www.scripting.com 1189 www.bbc.co.uk CLOSE www.scripting.com 1460 www.example.com ['HTTP/1.1', '404', 'Not Found\r\n'] www.example.com 269 www.example.com CLOSE www.scripting.com 1460 www.scripting.com 1460 www.scripting.com 1158 www.scripting.com CLOSE
To store things on disk, replace the dummy with a version that writes data to a file:
class file_consumer: def http_header(self, client): self.host = client.host self.file = None def http_failed(self, client): pass def feed(self, data): if self.file is None: self.file = open(self.host + ".rss", "w") self.file.write(data) def close(self): if self.file is not None: print self.host + ".rss ok" self.file.close() self.file = None
If you modify the test program to use this consumer instead of the dummy version, it’ll print something like this:
online.effbot.org.rss ok www.example.com.rss ok www.bbc.co.uk.rss ok www.scripting.com.rss ok
Three of the four files contain current RSS data. The fourth (from example.com) contains an HTML error message. To avoid storing error messages, it’s probably a good idea to let the consumer check the status field as well as the Content-Type header field. You can do this in the http_header method:
class file_consumer: def http_header(self, client): if (client.status != "200" or client.header["content-type"] != "text/xml"): print client.host, "failed" client.close() # bail out client.connected = 0 return self.host = client.host self.file = None ...
Note that consumer can simply call the client’s close method to shut down the connection. The client contains code that checks that it’s still connected after the http_header call, and avoids calling other consumer methods if it’s not.
Update 2002-09-08: not all versions of asyncore clears the connected attribute when the socket is closed. For example, the version shipped with Python 1.5.2 does, the version shipped with 2.1 doesn’t. To be on the safe side, you have to clear the flag yourself in the consumer.
That’s all for today. In the next article, we’ll look at how to parse at least some variant of the RSS format into a more useful data format.
While waiting, feel free to play with the code we’ve produced this far. Also, don’t forget to take a look at the RSS data files we just downloaded. Mark Nottingham’s RSS tutorial contains links to more information on various RSS formats.