How to transform pyarrow Table in order to use it with pyarrow.compute methods

56 Views Asked by At

I have a pyarrow Table, which I create using json.read_json(). The table schema is the following:

pyarrow.Table
_id: string
user_id: string
local_date_str: timestamp[s]
datetime: timestamp[s]
data: struct<aggregatable_quantity: double>
  child 0, aggregatable_quantity: double
----

What I want to do is to sum up all the aggregatable_quantity values. To do this, I want to apply the pyarrow.compute.sum function, which takes an array-like argument. My first try was to produce an array-like object from the data column of the table I have, so I did the following:

table = json.read_json("my_table.json")
dat = table.column('data')
my_sum = pc.sum(dat)

because the column() function returns a pyarrow.ChunkedArray object, which is array-like. However, when I run the code, I got the following:

pyarrow.lib.ArrowNotImplementedError: Function 'sum' has no kernel matching input types (struct<aggregatable_quantity: double>)

So, I guess, although dat is array-like, at runtime it amounts to a struct and not an array, hence the exception. Then I looked into it some more and found the following work-around:

dat2 = dat.chunk(0)
datTable = pa.Table.from_struct_array(dat2)
pandArray = datTable.to_pandas()
datArray = pa.Array.from_pandas(pandArray.aggregatable_quantity)
my_sum = pc.sum(datArray)

This works and I get the sum I want. However, it feels like an overkill, having to convert to and from pandas in order to get to a DoubleArray object (the datArray). Is there a better way to do this and reach an array-like object that I can use with pyarrow.compute functions, starting from a pyarrow.Table?

2

There are 2 best solutions below

2
A. Coady On BEST ANSWER

sum doesn't supports structs, regardless of whether the array is chunked. The struct_field compute function can be used to extract the aggregatable_quantity field.

pc.sum(pc.struct_field(table['data'], 'aggregatable_quantity'))
0
amoeba On

I think you need to use combine_chunks() to convert to a StructArray followed by field() to get the underlying Array before passing off to pc.sum. Something like this should work:

table = json.read_json("my_table.json")
dat = table.column('data')
arr = dat.combine_chunks().field(0)
pc.sum(arr)

In case it helps, here's a reproducible example:

import pyarrow as pa
import pyarrow.compute as pc

tbl = pa.table({'a':pa.array([{'a': 1}, {'a': 2}, {'a': 3}], pa.struct(fields))})
col = tbl.column("a")
pc.sum(col.combine_chunks().field(0))

This produces <pyarrow.Int64Scalar: 6>.