['term', 'extraction']

Fredrik Lundh | November 2005 | Originally posted to online.effbot.org

 

Erik Stattin linked to this page (dead link) which led me to this page (dead link) which reminded me of this which inspired me to whip up this little script:

# File: YahooTermExtraction.py
#
# An interface to Yahoo's Term Extraction service:
#
# http://developer.yahoo.net/search/content/V1/termExtraction.html
#
# "The Term Extraction Web Service provides a list of significant
# words or phrases extracted from a larger content."
#

import urllib
try:
    from xml.etree import ElementTree # 2.5 and later
except ImportError:
    from elementtree import ElementTree

URI = "http://api.search.yahoo.com"
URI = URI + "/ContentAnalysisService/V1/termExtraction"

def termExtraction(appid, context, query=None):
    d = dict(
        appid=appid,
        context=context.encode("utf-8")
        )
    if query:
        d["query"] = query.encode("utf-8")
    result = []
    f = urllib.urlopen(URI, urllib.urlencode(d))
    for event, elem in ElementTree.iterparse(f):
        if elem.tag == "{urn:yahoo:cate}Result":
            result.append(elem.text)
    return result

Usage:

 
>>> from YahooTermExtraction import termExtraction
>>> appid = "/your app id/"
>>> uri = "/some uri/"
>>> text = urllib.urlopen(uri).read()
>>> termExtraction(appid, text)[-5:]
['horrible picture', 'logo', 'spammer', 'moron', 'cat mouse']

(For best results, you should probably run the text through a HTML-to-text conversion before you send it to Yahoo. Some variation of this script might be useful.)

 

A Django site. rendered by a django application. hosted by webfaction.