We're back after a server migration that caused effbot.org to fall over a bit harder than expected. Expect some glitches.

The stringlib Library

Updated May 2006 | Posted August 2004 | Fredrik Lundh

The stringlib library is an experimental collection of alternative string operations for Python.

The fast search algorithm described below was added to Python 2.5 during the Need For Speed sprint in Reykjavik. It’s currently used by the 8-bit string and Unicode implementations.

Other components and concepts may appear in future Python releases.

The Fast Search Algorithm (aka “BMHBNFS” ;-)

The new find implementation is typically 2-10 times faster than the one used in Python 2.3 (using a reasonably representative set of test strings, that is). For the find portions of the stringbench test suite, the new algorithm is up to 26 times faster.

The same algorithm can be used for index, split, replace, __contains__, and the SRE prefix scanner.

When designing the new algorithm, I used the following constraints:

  • should be faster than the current brute-force algorithm for all test cases (based on real-life code), including Jim Hugunin’s worst-case test
  • small setup overhead; no dynamic allocation in the fast path (O(m) for speed, O(1) for storage)
  • sublinear search behaviour in good cases (O(n/m))
  • no worse than the current algorithm in worst case (O(nm))
  • should work well for both 8-bit strings and 16-bit or 32-bit Unicode strings (no O(σ) dependencies)
  • many real-life searches should be good, very few should be worst case
  • reasonably simple implementation

This rules out most standard algorithms (Knuth-Morris-Pratt is not sublinear, Boyer-Moore needs tables that depend on both the alphabet size and the pattern size, most Boyer-Moore variants need tables that depend on the pattern size, etc.).

After some tweaking, I came up with a simplication of Boyer-Moore, incorporating ideas from Horspool and Sunday. Here’s an outline:

def find(s, p):
    # find first occurrence of p in s
    n = len(s)
    m = len(p)
    skip = delta1(p)[p[m-1]]
    i = 0
    while i <= n-m:
        if s[i+m-1] == p[m-1]: # (boyer-moore)
            # potential match
            if s[i:i+m-1] == p[:m-1]:
                return i
            if s[i+m] not in p:
                i = i + m + 1 # (sunday)
            else:
                i = i + skip # (horspool)
        else:
            # skip
            if s[i+m] not in p:
                i = i + m + 1 # (sunday)
            else:
                i = i + 1
    return -1 # not found

The delta1(p)[p[m-1]] value is simply the Boyer-Moore delta1 (or bad-character skip) value for the last character in the pattern.

For the s[i+m] not in p test, I use a 32-bit bitmask, using the 5 least significant bits of the character as the key. This could be described as a simple Bloom filter.

Note that the above Python code may access s[n], which would result in an IndexError exception. For the CPython implementation, this is not really a problem, since CPython adds trailing NULL entries to both 8-bit and Unicode strings.

Benchmarks

As reported from the NeedForSpeed sprint (May 2006):

These are highly preliminary figures, but after two days of full attention, the Unicode string type is beginning to get pretty fast. The following is total run times for the stringbench benchmark:

str(ms) uni(ms) %       build
-----------------------------
2271.31 3608.32 62.9    2.5a2
2261.85 1187.84 190.4   trunk, wednesday morning
2227.48 1002.61 222.2   trunk, wednesday 15:11 UTC
2247.84  875.13 256.9   trunk, wednesday 16.35 UTC

The above is simply the sum of all benchmarks subtests, so the 4x factor is a bit arbitrary. However, you can tell two things from this: the unicode string type has gotten a lot faster, and on this set of tests, it’s actually twice as fast as the 8-bit string type. Not bad.

Looking at certain subsets are also quite interesting. Here’s the result for the count-related tests (which includes counting 4-letter combinations in a 175k DNA sequence):

str(ms) uni(ms) %       build
-----------------------------
332.44  294.07  113.0   2.5a2
329.90  120.63  273.5   trunk

and here’s the result for the find/index related tests:

str(ms) uni(ms) %       build
-----------------------------
761.40  1857.61 41.0    2.5a2
758.84  70.93   1069.8  trunk

(yes, it’s 26 times faster!)

Time to turn our attention to the 8-bit string type. More later.