We're back after a server migration that caused effbot.org to fall over a bit harder than expected. Expect some glitches.

['term', 'extraction']

Fredrik Lundh | November 2005 | Originally posted to online.effbot.org

Erik Stattin linked to this page which led me to this page which reminded me of this which inspired me to whip up this little script:

# File: YahooTermExtraction.py
#
# An interface to Yahoo's Term Extraction service:
#
# http://developer.yahoo.net/search/content/V1/termExtraction.html
#
# "The Term Extraction Web Service provides a list of significant
# words or phrases extracted from a larger content."
#

import urllib
try:
    from xml.etree import ElementTree # 2.5 and later
except ImportError:
    from elementtree import ElementTree

URI = "http://api.search.yahoo.com"
URI = URI + "/ContentAnalysisService/V1/termExtraction"

def termExtraction(appid, context, query=None):
    d = dict(
        appid=appid,
        context=context.encode("utf-8")
        )
    if query:
        d["query"] = query.encode("utf-8")
    result = []
    f = urllib.urlopen(URI, urllib.urlencode(d))
    for event, elem in ElementTree.iterparse(f):
        if elem.tag == "{urn:yahoo:cate}Result":
            result.append(elem.text)
    return result

Usage:

>>> from YahooTermExtraction import termExtraction
>>> appid = "/your app id/"
>>> uri = "/some uri/"
>>> text = urllib.urlopen(uri).read()
>>> termExtraction(appid, text)[-5:]
['horrible picture', 'logo', 'spammer', 'moron', 'cat mouse']

(For best results, you should probably run the text through a HTML-to-text conversion before you send it to Yahoo. Some variation of this script might be useful.)