Understanding Python's "for" statement
Fredrik Lundh | November 2006 | Originally posted to online.effbot.org
One of the things I noticed when skimming through the various reactions to my recent “with”-article is that some people seem to have a somewhat fuzzy understanding of Python’s other block statement, the good old for-in loop statement. The with statement didn’t introduce code blocks in Python; they’ve always been there. To rectify this, for-in probably deserves it’s own article, so here we go (but be warned that the following is a bit rough; I reserve the right to tweak it a little over the next few days).
On the surface, Python’s for-in statement is taken right away from Python’s predecessor ABC, where it’s described as:
FOR name,... IN train: commands Take each element of train in turn
In ABC, what’s called statements in Python are known as
commands, and sequences are known as
trains. (The whole
language is like that, by the way; lots of common mechanisms described
using less-common names. Maybe they thought that renaming everything
would make it easier for people to pick up the subtle details of the
language, instead of assuming that everything worked exactly as other
seemingly similar languages, or maybe it only makes sense if you’re
Anyway, to take each element (item) from a train (sequence) in turn, we can simply do (using a psuedo-Python syntax):
name = train do something with name name = train do something with name name = train do something with name ... etc ...
and keep doing that until we run out of items. When we do, we’ll get an IndexError exception, which tells us that it’s time to stop.
And in its simplest and original form, this is exactly what the for-in statement does; when you write
for name in train: do something with name
the interpreter will simply fetch train and assign it to name, and then execute the code block. It’ll then fetch train, train, and so on, until it gets an IndexError.
The code inside the for-in loop is executed in the same scope as the surrounding code; in the following example:
train = 1, 2, 3 for name in train: value = name * 10 print value
the variables train, name, and value all live in the same namespace.
This is pretty straightforward, of course, but it immediately gets
a bit more interesting once you realize that you can use custom
trains. Just implement the __getitem__
method, and you can control how the loop behaves. The following code:
class MyTrain: def __getitem__(self, index): if not condition: raise IndexError("that's enough!") value = fetch item identified by index return value # hand control back to the block for name in MyTrain(): do something with name
will run the loop as long as the given condition is true, with values
provided by the custom train. In other words, the
part is turned into a block of code that’s being executed under the
control of the custom sequence object. The above is equivalent to:
index = 0 while True: # run forever if not condition: break name = fetch item identified by index do something with name index = index + 1
except that index is a hidden variable, and the controlling code is placed in a separate object.
You can use this mechanism for everything from generating sequence elements on the fly (like xrange):
class MySequence: def __getitem__(self, index): if index > 10: raise IndexError("that's enough!") return value * 10 # returns 0, 10, 20, ..., 100
and fetching data from an external source:
class MyTable: def __getitem__(self, index): value = fetch item index from database table if value not found: raise IndexError("not found") return value
or from a stream:
class MyFileIterator: def __getitem__(self, index): text = get next line from file if end of file: raise IndexError("end of file") return text
to fetching data from some other source:
class MyEventSource: def __getitem__(self, index): event = get next event if event == terminate: raise IndexError return event for event in MyEventSource(): process event
It’s more explicit in the latter examples, but in all these examples, the code in __getitem__ is basically treating the block of code inside the for-in loop as an in-lined callback.
Also note how the last two examples don’t even bother to look at
the index; they just keep
calling the for-in block until
they run out of data. Or, less obvious, until they run out of bits in
the internal index variable.
To deal with this, and also avoid the issue with having objects that looks a lot as sequences, but doesn’t support random access, the for-in statement was redesigned in Python 2.2. Instead of using the __getitem__ interface, for-in now starts by looking for an __iter__ hook. If present, this method is called, and the resulting object is then used to fetch items, one by one. This new protocol behaves like this:
obj = train.__iter__() name = obj.next() do something with name name = obj.next() do something with name ...
where obj is an internal variable, and the next method indicates end of data by raising the StopIterator exception, instead of IndexError. Using a custom object can look something like:
class MyTrain: def __iter__(self): return self def next(self): if not condition: raise StopIteration value = calculate next value return value # hand control over to the block for name in MyTrain(): do something with name
(Here, the MyTrain object returns itself, which means that the for-in statement will call MyTrain’s own next method to do the actual work. In some cases, it makes more sense to use an independent object for the iteration).
Using this mechanism, we can now rewrite the file iterator from above as:
class MyFileIterator: def __iter__(self): return self # use myself def next(): text = get next line from file if end of file: raise StopIteration() return text
and, with very little work, get an object that doesn’t support normal indexing, and doesn’t break down if used on a file with more than 2 billion lines.
But what about ordinary sequences, you ask? That’s of course easily handled by a wrapper object, that keeps an internal counter, and maps next calls to __getitem__ calls, in exactly the same way as the original for-in statement did. Python provides a standard implementation of such an object, iter, which is used automatically if __iter__ doesn’t exist.
This wasn’t very difficult, was it?
Footnote: In Python 2.2 and later, several non-sequence objects have been extended to support the new protocol. For example, you can loop over both text files and dictionaries; the former return lines of text, the latter dictionary keys.
for line in open("file.txt"): do something with line for key in my_dict: do something with key