It seems that I can memmap the underlying data for a python series by creating a mmap'd ndarray and using it to initialize the Series.
def assert_readonly(iloc):
try:
iloc[0] = 999 # Should be non-editable
raise Exception("MUST BE READ ONLY (1)")
except ValueError as e:
assert "read-only" in e.message
# Original ndarray
n = 1000
_arr = np.arange(0,1000, dtype=float)
# Convert it to a memmap
mm = np.memmap(filename, mode='w+', shape=_arr.shape, dtype=_arr.dtype)
mm[:] = _arr[:]
del _arr
mm.flush()
mm.flags['WRITEABLE'] = False # Make immutable!
# Wrap as a series
s = pd.Series(mm, name="a")
assert_readonly(s.iloc)
Success! Its seems that s
is backed by a read-only mem-mapped ndarray.
Can I do the same for a DataFrame? The following fails
df = pd.DataFrame(s, copy=False, columns=['a'])
assert_readonly(df["a"]) # Fails
The following succeeds, but only for one column:
df = pd.DataFrame(mm.reshape(len(mm,1)), columns=['a'], copy=False)
assert_readonly(df["a"]) # Succeeds
... so I can make a DF without copying. However, this only works for one column, and I want many. Method I've found for combining 1-column DFs: pd.concat(..copy=False), pd.merge(copy=False), ... result in copies.
I have some thousands of large columns as datafiles, of which I only ever need a few at a time. I was hoping I'd be able to place their mmap'd representations in a DataFrame as above. Is it possible?
Pandas documentation makes it a little difficult to guess about what's going on under the hood here - although it does say a DataFrame "Can be thought of as a dict-like container for Series objects.". I'm beginning to this this is no longer the case.
I'd prefer not to need HD5 to solve this.
OK... after a lot of digging here's what's going on.
While pandas maintains a reference to the supplies array for a series when the copy=False parameter is supplied to the constructor:
It does not for a DataFrame:
Pandas'
DataFrame
uses theBlockManager
class to organize the data internally. Contrary to the docs,DataFrame
is NOT a collection of series but a collection of similarly dtyped matrices.BlockManger
groups all the float columns together, all the int columns together, etc..., and their memory (from what I can tell) is kept together.It can do that without copying the memory ONLY if a single
ndarray
matrix (a single type) is provided. Note,BlockManager
(in theory) also supports not-copying mixed type data in its construction as it may not be necessary to copy this input into same-typed chunked. However, theDataFrame
constructor doesn't make a copy ONLY if a single matrix is the data parameter.In short, if you have mixed types or multiple arrays as input to the constructor, or provide a dict with a single array, you are out of luck in Pandas, and
DataFrame
's defaultBlockManager
will copy your data.In any case, one way to work around this is to force
BlockManager
to not consolidate-by-type, but to keep each column as a separate 'block'. So, with monkey-patching magic...It would be better if
DataFrame
orBlockManger
had aconsolidate=False
(or assumed this behavior) ifcopy=False
was specified.To test:
It seems a little questionable to me whether there's really practical benefits to
BlockManager
requiring similarly typed data to be kept together -- most of the operations in Pandas are label-row-wise, or per column -- this follows from aDataFrame
being a structure of heterogeneous columns that are usually only associated by their index. Though feasibly they're keeping one index per 'block', gaining benefit if the index keeps offsets into the block (if this was the case, then they should groups bysizeof(dtype)
, which I don't think is the case).Ho hum...
There was some discussion about a PR to provide a non-copying constructor, which was abandoned.
It looks like there's sensible plans to phase out
BlockManager
, so your mileage may vary.Also see Pandas under the hood, which helped me a lot.