Is there any way to find the buffer size of a file object

3k Views Asked by At

I'm trying to "map" a very large ascii file. Basically I read lines until I find a certain tag and then I want to know the position of that tag so that I can seek to it again later to pull out the associated data.

from itertools import dropwhile
with open(datafile) as fin:
    ifin = dropwhile(lambda x:not x.startswith('Foo'), fin)
    header = next(ifin)
    position = fin.tell()

Now this tell doesn't give me the right position. This question has been asked in various forms before. The reason is presumably because python is buffering the file object. So, python is telling me where it's file-pointer is, not where my file pointer is. I don't want to turn off this buffering ... The performance here is important. However, it would be nice to know if there is a way to determine how many bytes python chooses to buffer. In my actual application, as long as I'm close the the lines which start with Foo, it doesn't matter. I can drop a few lines here and there. So, what I'm actually planning on doing is something like:

position = fin.tell() - buffer_size(fin)

Is there any way to go about finding the buffer size?

1

There are 1 best solutions below

3
On

To me, it looks like the buffer size is hard-coded in Cpython to be 8192. As far as I can tell, there is no way to get this number from the python interface other than to read a single line when you open the file, do a f.tell() to figure out how much data python actually read and then seek back to the start of the file before continuing.

with open(datafile) as fin:
    next(fin)
    bufsize = fin.tell()
    fin.seek(0)

    ifin = dropwhile(lambda x:not x.startswith('Foo'), fin)
    header = next(ifin)
    position = fin.tell()

Of course, this fails in the event that the first line is longer than 8192 bytes long, but that's not of any real consequence for my application.