I have a pyarrow Table, which I create using json.read_json(). The table schema is the following:
pyarrow.Table
_id: string
user_id: string
local_date_str: timestamp[s]
datetime: timestamp[s]
data: struct<aggregatable_quantity: double>
child 0, aggregatable_quantity: double
----
What I want to do is to sum up all the aggregatable_quantity values. To do this, I want to apply the pyarrow.compute.sum function, which takes an array-like argument. My first try was to produce an array-like object from the data column of the table I have, so I did the following:
table = json.read_json("my_table.json")
dat = table.column('data')
my_sum = pc.sum(dat)
because the column() function returns a pyarrow.ChunkedArray object, which is array-like. However, when I run the code, I got the following:
pyarrow.lib.ArrowNotImplementedError: Function 'sum' has no kernel matching input types (struct<aggregatable_quantity: double>)
So, I guess, although dat is array-like, at runtime it amounts to a struct and not an array, hence the exception. Then I looked into it some more and found the following work-around:
dat2 = dat.chunk(0)
datTable = pa.Table.from_struct_array(dat2)
pandArray = datTable.to_pandas()
datArray = pa.Array.from_pandas(pandArray.aggregatable_quantity)
my_sum = pc.sum(datArray)
This works and I get the sum I want. However, it feels like an overkill, having to convert to and from pandas in order to get to a DoubleArray object (the datArray). Is there a better way to do this and reach an array-like object that I can use with pyarrow.compute functions, starting from a pyarrow.Table?
sumdoesn't supports structs, regardless of whether the array is chunked. Thestruct_fieldcompute function can be used to extract theaggregatable_quantityfield.