Polars: Is there a procedure that ensures zero copy when calling df.to_numpy()?

138 Views Asked by At

For some time now I am failing to call df.to_numpy(allow_copy = True). Is there a procedure that transforms any given dataset into "zero_copy suitable" one?

For list-like values I tried

data = pl.DataFrame(dict(
    points=np.random.sample((N, 3)),
    color=np.random.sample((N, 4))
),
schema=dict(
    points=pl.Array(pl.Float64, 3),
    color=pl.Array(pl.Float64, 4),
))

or simply expr.cast(pl.Array(pl.Float32, 4)) as suggested here. It works for one of my datasets, but fails for a different one with slightly different build .

Calling rechunk(), having no null values and/or specifying order = "c" or "fortran" also seems to have no effect.

This is a generalization of my previous question that was perhaps too specific to get a real answer.

2

There are 2 best solutions below

3
ritchie46 On

No, that operation would copy itself. Numpy matrices are contiguously allocated in a single allocation. Where Polars mostly contiguously allocated per column. That means that if memory is allocated by Polars, it cannot be transferred zero-copy to a numpy 2D matrix.

Polars columns and Polars Array types can be moved zero-copy to numpy.

Moving data from numpy to Polars and back.

A 2D matrix from numpy in fortran order, can be moved zero-copy to Polars and again zero-copy back to numpy. This works because the original numpy allocation will not be changed. All Polars columns point into the numpy contiguous array memory.

Why only F-order?

Because Polars is a columnar query engine, we only store F-contiguous data. Otherwise we would need to skip rows on column traversal. This would duplicate all code, or add an indirection for traversal which would be much slower.

Aside from that, you now have row values in your cache line, making all columnar algorithms much much slower.

C-contiguous

In c-contiguous, row values are back to back. If we need column_value we must skip n row_x slots, where n is proportional on the number of columns.

[column_value_1, row_x, row_x, ...,  row_x, column_value_2]

F-contiguous

In f-contiguous, column slots are back to back, leading to fast cache efficient reading

[column_value_1, column_value_2, ..., column_value_n]

So zero-copy C-contiguous data DataFrames comes with a performance cost for most operations (which are columnar) and might even trigger a copy if underlying columnar algorithms expect slices with data back to back (which isn't far fetched).

TLDR;

DataFrame implementations that support zero-copy of c-order data pay a price with cache trashing or make implicit copies if the algorithms expect a slice.

4
Dean MacGregor On

To expand a bit on ritchie46's answer in a way that I understand it.

Let's start with this table

1   2   3
4   5   6
7   8   9
a   b   c

(Let's say abc is 10, 11, 12 respectively not that we're mixing numbers and letters)

Computer memory doesn't have dimensions so when we say that it is columnar storage it means that the computer sees that as data=147a258b369c and offsets=048 where the data is each column and the offsets tell it where new columns start for us humans to look at.

In row oriented storage it will have data=123456789abc and offsets=0369.

Why would one framework/library/software use columnar while another uses row based?

Well think about something like a database of customers where you learn the customer's first name, last name, and address at the same time so you store that at once. A database is row oriented so that when you put that new record in, it goes together.

Of course when you want to analyze data it's usually the opposite, for instance if you want to know how many customers live in each state then the only thing you want is the address column. To analyze the column in row oriented storage means it has to refer to the offset and pluck out each of those locations from the data. On the other hand if it were column oriented data then it could just get the range of the data rather than having to pluck at what amounts to random locations.

The reason it must be copied is that each library is optimized around the assumption that data is organized contiguously (polars/array can have chunks that are discontinuous but within the chunks they're still continuous) in row or column order. You can't just tell numpy ok here's some column oriented data you figure it out.