The HTMLParser Module
(New in 2.2) An improved HTML parser. Can be used to replace sgmllib, in many cases.
Like the other parsers in the standard library, this parser implements the standard feed/close consumer protocol, and calls methods on itself to handle the various parts of the HTML document. To use the parser, create a subclass where you override the methods you’re interested in.
This example extracts anchor links from an HTML document:
import HTMLParser class AnchorParser(HTMLParser.HTMLParser): def __init__(self): self.anchors = [] self.reset() def handle_starttag(self, tag, attrs): if tag == "a": for k, v in attrs: if k == "href": self.anchors.append(v) break f = open("sample.html") p = AnchorParser() p.feed(f.read()) p.close() print p.anchors
Here’s an alternate driver that lets you iterate over the anchors, as they are found by the parser:
class AnchorParser: ... def getanchors(file): p = AnchorParser() while 1: # get some data from the source s = file.read(16384) if s: p.feed(s) else: p.close() # return anchors to caller for anchor in p.anchors: yield anchor if not s: break p.anchors[:] = [] # reset the list # read from a file for anchor in getanchors(open("index.html")): print anchor # read from a remote site from urllib import urlopen for anchor in getanchors(urlopen("http://www.python.org")): print anchor