Storing ndarrays into Parquet via uber/petastorm?

781 Views Asked by At

Is it possible to store N-dimensional arrays into Parquet via uber/petastorm ?

1

There are 1 best solutions below

0
On BEST ANSWER

Yes. Petastorm provides a custom layer of codecs and a schema extension on top of standard Apache Parquet format. The n-dimensional arrays / tensors would be serialized into binary blob fields. From the user perspective, these would look like native types, depends on the environment you work with (pure Python/pyspark: numpy/array, tf.Tensor in Tensorflow or torch Tensors in PyTorch).

There are some easy to follow examples here: https://github.com/uber/petastorm/tree/master/examples/hello_world/petastorm_dataset