July 2003 | Fredrik Lundh
This is the fifth article covering the effnews project; a simple RSS newsreader written in Python. Other articles in this series are available via this page.
This article is being edited.
Improving the RSS Support #
Supporting Non-XML Character Entities #
Many RSS feeds embed non-XML character entities in the description and title fields. This is allowed by the original 0.9 and 0.91 standards, but it’s unclear whether later standards really support this. Not that the standards matter here; feeds of all kinds use the entities, so we have to deal with them anyway.
The xmllib parser uses an entitydefs dictionary to translate entities to character strings. If an entity is not defined by this dictionary, the parser calls the unknown_entityref method. The following addition to our rss_parser class adds all standard HTML entities to the entitydefs dictionary when it’s first called, and replaces all other entities to an empty string.
class rss_parser(xmllib.XMLParser): ... htmlentitydefs = None def unknown_entityref(self, entity): if not self.htmlentitydefs: # lazy loading of entitydefs table import htmlentitydefs # make sure we don't overwrite entities already present in # the entitydefs dictionary (doing so will confuse xmllib) entitydefs = htmlentitydefs.entitydefs.copy() entitydefs.update(self.entitydefs) self.entitydefs = self.htmlentitydefs = entitydefs self.handle_data(self.entitydefs.get(entity, "")) ...
Handling Non-ASCII Character Sets #
Handling Windows CP1252 Gremlins #
Improving the HTTP Support #
Dealing With Different Content Types #
Using a list of feeds from Syndic8.com, I’ve tried the current RSS parser (including the entity support) on just over 2000 RSS feeds. The result isn’t very encouraging:
2010 feeds checked
137 feeds (6.8%) successfully read:
rss 0.9: 17 feeds
rss 0.91: 84 feeds
rss 0.91fn: 2 feeds
rss 0.92: 20 feeds
rss 1.0: 10 feeds
rss 2.0: 4 feeds
As it turns out, the problem isn’t so much the parser as the protocol layer; the current code only accepts responses if they’re using the text/xml content type. Here’s a breakdown of the feeds that returned a valid HTTP response. The following list shows the HTTP status code (200=OK) and the specified content type:
200 'text/plain; charset=utf-8': 1 feed 301 'text/html; charset=iso-8859-1': 1 feed 200 'text/html;charset=iso-8859-1': 1 feed 200 'text/xml; charset=utf-8': 1 feed 403 'text/html; charset=iso-8859-1': 1 feed 200 'text/XML': 1 feed 302 'text/html; charset=ISO-8859-1': 1 feed 200 'application/x-cdf': 1 feed 200 'application/unknown': 1 feed 200 'httpd/unix-directory': 2 feeds 200 'text/rdf': 2 feeds 200 'application/rss+xml': 2 feeds 200 'text/xml; charset=ISO-8859-1': 2 feeds 404 'text/html; charset=iso-8859-1': 3 feeds 200 'application/sgml': 3 feeds 302 'text/html; charset=iso-8859-1': 4 feeds 200 'text/html; charset=iso-8859-1': 4 feeds 200 'application/x-netcdf': 5 feeds 200 'text/plain; charset=ISO-8859-1': 7 feeds 200 'text/plain; charset=iso-8859-1': 8 feeds 200 'application/octet-stream': 10 feeds 200 'application/xml': 18 feeds 200 'text/html': 42 feeds 200 'text/xml': 191 feeds 200 'text/plain': 1660 feeds
Most feeds are returned as text/plain, and many use little-known (or unregistered) content types. The charset parameter is also somewhat common.
If we remove the check for content type from the http_rss_parser class, we get the following result:
class http_rss_parser(rss_parser.rss_parser): ... def http_header(self, client): if client.status[1] != "200": raise http_client.CloseConnection
1746 feeds (86.9%) successfully read:
rss unknown: 1 feed
rss 0.9: 55 feeds
rss 0.91: 1623 feeds
rss 0.91fn: 2 feeds
rss 0.92: 22 feeds
rss 1.0: 39 feeds
rss 2.0: 4 feeds
Handling Redirection #
class http_rss_parser(rss_parser.rss_parser): ... def http_header(self, client): if client.status[1].startwith("3"): ... redirect ... location = client.header["location"]
Handling Other Status Codes
class http_rss_parser(rss_parser.rss_parser): ... def http_header(self, client): status = client.status[1] status_category = status[:1] if status_category == "3": ... redirect ... location = client.header["location"] elif status_category == "2": ... accept ... else: ...
