Question up front:
Is there a pythonic way in the standard library for parsing raw binary files using for ... in ...
syntax (i.e., __iter__
/__next__
) that yields blocks that respect the buffersize
parameter, without having to subclass IOBase
or its child classes?
Detailed explanation
I'd like to open a raw file for parsing, making use of the for ... in ...
syntax, and I'd like that syntax to yield predictably shaped objects. This wasn't happening as expected for a problem I was working on, so I tried the following test (import numpy as np
required):
In [271]: with open('tinytest.dat', 'wb') as f:
...: f.write(np.random.randint(0, 256, 16384, dtype=np.uint8).tobytes())
...:
In [272]: np.array([len(b) for b in open('tinytest.dat', 'rb', 16)])
Out[272]:
array([ 13, 138, 196, 263, 719, 98, 476, 3, 266, 63, 51,
241, 472, 75, 120, 137, 14, 342, 148, 399, 366, 360,
41, 9, 141, 282, 7, 159, 341, 355, 470, 427, 214,
42, 1095, 84, 284, 366, 117, 187, 188, 54, 611, 246,
743, 194, 11, 38, 196, 1368, 4, 21, 442, 169, 22,
207, 226, 227, 193, 677, 174, 110, 273, 52, 357])
I could not understand why this random behavior was arising, and why it was not respecting the buffersize
argument. Using read1
gave the expected number of bytes:
In [273]: with open('tinytest.dat', 'rb', 16) as f:
...: b = f.read1()
...: print(len(b))
...: print(b)
...:
16
b'M\xfb\xea\xc0X\xd4U%3\xad\xc9u\n\x0f8}'
And there it is: A newline near the end of the first block.
In [274]: with open('tinytest.dat', 'rb', 2048) as f:
...: print(f.readline())
...:
b'M\xfb\xea\xc0X\xd4U%3\xad\xc9u\n'
Sure enough, readline
was being called to produce each block of the file, and it was tripping up on the newline value (corresponding to 10). I verified this reading through the code, lines in the definition of IOBase:
571 def __next__(self):
572 line = self.readline()
573 if not line:
574 raise StopIteration
575 return line
So my question is this: is there some more pythonic way to achieve buffersize
-respecting raw file behavior that allows for ... in ...
syntax, without having to subclass IOBase
or its child classes (and thus, not being part of the standard library)? If not, does this unexpected behavior warrant a PEP? (Or does it warrant learning to expect the behavior?:)
This behavior isn't unexpected, it is documented that all objects derived from
IOBase
iterate over lines. The only thing that changes between binary vs text mode is how a line terminator is defined, it is always defined asb"\n"
in binary mode.The docs:
The problem is that there used to historically be ambiguity between text and binary data in the type system, this was a major motivating factor of the Python 2 -> 3 transition breaking backwards-compatibility.
I think it would certainly be reasonable to have the iterator protocol respect the buffer size for file objects opened in binary mode in Python 3. Why it was decided to keep the old behavior is something I can only speculate about.
In any case, you should just define your own iterator, that is common in Python. Iterators are a basic building block, like built-in types.
You can actually use the 2-argument
iter(callable, sentinel)
form to construct a super basic wrapper:Of course, you could have just used a generator:
There are tons of ways of approaching this. Again, iterators are a core type for writing idiomatic Python.
Python is a pretty dyanamic language, and "duck typing" is the name of the game. Generally, your first instinct shouldn't be "how to subclass some built-in type to extend functionality". I mean, often that is possible, but you'll find that there are a lot of language features geared towards not having to do that, and often, it is simply better expressed that way to begin with, at least, usually to my eyes.