Sandbox: SourceForge Tools (In Progress)
Fredrik Lundh | April 2006
Note: The sourceforge page layout was changed slightly after the first version of these tools were released. We’re working on a new version, but if you want to use the tools to experiment with older tracker snapshots, you need version 200604. See below.
sourceforge sandbox (dead link) contains a set of simple tools to download and process sourceforge
To run the download tools, you also need the tidy utility.
Current Version (200608, Work in Progress)
To download the current version of the tools, use Subversion:
$ svn co http://svn.effbot.python-hosting.com/stuff/sandbox/sourceforge
Previous version (200604)
This version is compatible with the sourceforge tracker layout used in April 2006.
$ svn co http://svn.effbot.python-hosting.com/tags/sourceforge-200604/
A snapshot of the Python tracker data from April 2006 can be downloaded here:
tracker-20060403.zip(dead link) [~10000 items, 80 MB]
Tracker Datasets #
Tracker data is represented as a set of files in a tracker directory. For each tracker item, there are at least two files:
tracker-TTT/item-NNN.xml (index information, created by getindex.py) tracker-TTT/item-NNN-page.xml (xhtml pages, created by getpages.py)
where TTT is the tracker identifier, and NNN is the item identifier.
For items that have attached files, there’s also one or more
tracker-TTT/item-NNN-data-MMM.dat (data files, created by getfiles.py)
files, where MMM is a file identifier (referred to by the page files). The data files consists of a copy of the HTTP header (which includes content-type and content-disposition headers), followed by an empty line, and the actual data.
Note that the datasets contain complete HTML pages. This lets you fix bugs in the extraction tools without having to reload everything again (or download large existing datasets).
Processing Tracker Datasets #
To process tracker datasets, use the extract module to extract relevant information from item-NNN-page.xml files. See the export scripts for examples:
More export scripts, bug fixes, and other contributions are welcome.
Downloading and Updating Tracker Datasets
To download tracker datasets, run 'init' to set things up, and use the getindex/getpages/getfiles scripts to download items. * init The 'init' script is used to select what tracker to download. It asks for a tracker "group id". To get the group id for your project, check the URL for the tracker homepage. If you press return, the group id defaults to 5470, which is the group id for the Python tracker. The 'init' script downloads the tracker homepage, and creates tracker directories for the individual trackers used by the given project. $ python init.py enter sourceforge tracker group id : 1234 --- create tracker-123456 You only have to run the 'init' script once for each project. * getindex The 'getindex' script parses the tracker index, and creates item files which contains overview information from the index pages. Usage: $ python getindex.py tracker-123456 [offset] If the offset is omitted, the parser starts at offset 0, and keeps going until it gets an index page for which all items have already been downloaded. If an offset is given, the parser keeps going until it cannot find any more items. You can use the output from 'getindex' to generate tracker statistics. To get more information about the items, use the 'getpages' and 'get- files' scripts. * getpages The 'getpages' script looks for item files, and downloads missing page files. $ python getpages.py tracker-123456 To refresh the page files, remove them from the tracker directory, and run the 'getpages' script again. $ rm tracker-123456/*-page.xml $ python getpages.py tracker-123456 * getfiles The 'getfiles' script, finally, looks for download links in the page files, and downloads missing data files. $ python getfiles.py tracker-123456 * status The 'status' script can be used to get a download status summary: $ python status.py tracker-123456 6682 items 6682 pages (100%) 1912 files