How to lazily read from parquet with awkward2

29 Views Asked by At

I'm thinking of migrating from awkward 1 to 2.

I used lazy reading from parquet file using awkward1.

tree = ak.from_parquet(filename, lazy=True) # data is not read
print(np.max(tree["x"]),end=" ") # "x" is just read here.

How can I realize this with awkward2?

1

There are 1 best solutions below

2
Jim Pivarski On

The one major interface change from Awkward 1.x to Awkward 2.x is that lazy arrays (PartitionedArray of VirtualArray) have been replaced by dask-awkward. The motivation for this was to give users more control over when the array-reading happens (when you say, .compute(), and not before), as well as where (distributed on any cluster that runs Dask jobs).

Here's an example:

>>> import awkward as ak
>>> import dask_awkward as dak
>>> tree = dak.from_parquet("https://pivarski-princeton.s3.amazonaws.com/chicago-taxi.parquet")
>>> tree
dask.awkward<from-parquet, npartitions=1>

This object represents data that have not been read yet. It's a kind of lazy array. It has all of the metadata, such as field names and data types.

>>> ak.fields(tree)
['trip', 'payment', 'company']
>>> ak.fields(tree.trip)
['sec', 'km', 'begin', 'end', 'path']
>>> tree.trip.type.show()
?? * var * {
    sec: ?float32,
    km: ?float32,
    begin: {
        lon: ?float64,
        lat: ?float64,
        time: ?datetime64[ms]
    },
    end: {
        lon: ?float64,
        lat: ?float64,
        time: ?datetime64[ms]
    },
    path: var * {
        londiff: float32,
        latdiff: float32
    }
}

You can perform a lazy computation (all ak.* functions work on dak.Arrays), and it remains lazy, unlike an Awkward 1.x array:

>>> result = ak.max(tree.trip.sec)
>>> result
dask.awkward<max, type=Scalar, dtype=float32>

But when you say, .compute(), you get fully evaluated data.

>>> result.compute()
86400.0

See dask.org for distributing and scaling up processes, either on one computer or on a cluster of computers.