How to transform pyarrow Table in order to use it with pyarrow.compute methods

Question

How to transform pyarrow Table in order to use it with pyarrow.compute methods

56 Views Asked by zerzevul At 14 March 2024 at 17:00

I have a pyarrow Table, which I create using json.read_json(). The table schema is the following:

pyarrow.Table
_id: string
user_id: string
local_date_str: timestamp[s]
datetime: timestamp[s]
data: struct<aggregatable_quantity: double>
  child 0, aggregatable_quantity: double
----

What I want to do is to sum up all the aggregatable_quantity values. To do this, I want to apply the pyarrow.compute.sum function, which takes an array-like argument. My first try was to produce an array-like object from the data column of the table I have, so I did the following:

table = json.read_json("my_table.json")
dat = table.column('data')
my_sum = pc.sum(dat)

because the column() function returns a pyarrow.ChunkedArray object, which is array-like. However, when I run the code, I got the following:

pyarrow.lib.ArrowNotImplementedError: Function 'sum' has no kernel matching input types (struct<aggregatable_quantity: double>)

So, I guess, although dat is array-like, at runtime it amounts to a struct and not an array, hence the exception. Then I looked into it some more and found the following work-around:

dat2 = dat.chunk(0)
datTable = pa.Table.from_struct_array(dat2)
pandArray = datTable.to_pandas()
datArray = pa.Array.from_pandas(pandArray.aggregatable_quantity)
my_sum = pc.sum(datArray)

This works and I get the sum I want. However, it feels like an overkill, having to convert to and from pandas in order to get to a DoubleArray object (the datArray). Is there a better way to do this and reach an array-like object that I can use with pyarrow.compute functions, starting from a pyarrow.Table?

Original Q&A

There are 2 best solutions below

amoeba On 14 March 2024 at 17:29

I think you need to use combine_chunks() to convert to a StructArray followed by field() to get the underlying Array before passing off to pc.sum. Something like this should work:

table = json.read_json("my_table.json")
dat = table.column('data')
arr = dat.combine_chunks().field(0)
pc.sum(arr)

In case it helps, here's a reproducible example:

import pyarrow as pa
import pyarrow.compute as pc

tbl = pa.table({'a':pa.array([{'a': 1}, {'a': 2}, {'a': 3}], pa.struct(fields))})
col = tbl.column("a")
pc.sum(col.combine_chunks().field(0))

This produces <pyarrow.Int64Scalar: 6>.

**A. Coady** · Accepted Answer · 2024-03-14T20:39:24.113000

A. Coady On 14 March 2024 at 20:39 BEST ANSWER

sum doesn't supports structs, regardless of whether the array is chunked. The struct_field compute function can be used to extract the aggregatable_quantity field.

pc.sum(pc.struct_field(table['data'], 'aggregatable_quantity'))

How to transform pyarrow Table in order to use it with pyarrow.compute methods

There are 2 best solutions below

Related Questions in PYARROW

Trending Questions

Popular # Hahtags

Popular Questions