The urlparse module
This module contains functions to process uniform resource locators (URLs), and to convert between URLs and platform-specific filenames.
Example: Using the urlparse
module
# File: urlparse-example-1.py import urlparse print urlparse.urlparse("http://host/path;params?query#fragment")
('http', 'host', '/path', 'params', 'query', 'fragment')
A common use is to split an HTTP URLs into host and path components (an HTTP request involves asking the host to return data identified by the path):
Example: Using the urlparse
module to parse HTTP locators
# File: urlparse-example-2.py import urlparse scheme, host, path, params, query, fragment =\ urlparse.urlparse("http://host/path;params?query#fragment") if scheme == "http": print "host", "=>", host if params: path = path + ";" + params if query: path = path + "?" + query print "path", "=>", path
host => host path => /path;params?query
Alternatively, you can use the urlunparse function to put the URL back together again:
Example: Using the urlparse
module to parse HTTP locators
# File: urlparse-example-3.py import urlparse scheme, host, path, params, query, fragment =\ urlparse.urlparse("http://host/path;params?query#fragment") if scheme == "http": print "host", "=>", host print "path", "=>", urlparse.urlunparse((None, None, path, params, query, None))
host => host path => /path;params?query
The urljoin function is used to combine an absolute URL with a second, possibly relative URL:
Example: Using the urlparse
module to combine relative locators
# File: urlparse-example-4.py import urlparse base = "http://spam.egg/my/little/pony" for path in "/index", "goldfish", "../black/cat": print path, "=>", urlparse.urljoin(base, path)
/index => http://spam.egg/index goldfish => http://spam.egg/my/little/goldfish ../black/cat => http://spam.egg/my/black/cat
[comment on/vote for this article]

Comment:
Really nifty module, thanks. I'm rolling my own URIs in addition to the standard http, ftp and file ones and one feature that would be really useful would be the ability to add and/or modify how specific schemes are parsed. To that end, insert the following code somewhere in urlparse.py
def edit_protocol(scheme, relative=1, netloc=1, hierarchical=1, params=1, query=1, fragment=1): """Add, remove or update a specific scheme, and how it's handled""" clear_cache() global uses_relative, uses_netloc, non_hierarchical global uses_params, uses_query, uses_fragment if relative == 1: uses_relative.append(scheme) elif relative == 0: uses_relative = [x for x in uses_relative if x != scheme] if netloc == 1: uses_netloc.append(scheme) elif netloc == 0: uses_netloc = [x for x in uses_netloc if x != scheme] if hierarchical == 0: non_hierarchical.append(scheme) elif hierarchical == 1: non_hierarchical= [x for x in non_hierarchical if x != scheme] if params == 1: uses_params.append(scheme) elif params == 0: uses_params = [x for x in uses_params if x != scheme] if query == 1: uses_query.append(scheme) elif query == 0: uses_query = [x for x in uses_query if x != scheme] if fragment == 1: uses_fragment.append(scheme) elif fragment == 0: uses_fragment = [x for x in uses_fragment if x != scheme]use it like so:
from urlparse import urlparse, edit_protocol uris = [ 'data://tabledata/data/XML', 'http://127.0.0.1/hello.html', 'file:///C:/windows/system_file.txt', ] for uri in uris: print urlparse(uri) edit_protocol("data", 1,1,1,1,1,1) (or edit_protocol("data", relative=1, netloc=1, hierarchical=1, params=1, query=1, fragment=1) ) for uri in uris: print urlparse(uri)Output:
('data', '', '//tabledata/data/XML', '', '', '') ... ('data', 'tabledata', '/data/XML', '', '', '')Posted by Robin Macharg (2007-03-01)