Iterate pyarrow._flight.FlightStreamReader

704 Views Asked by At

How do I iterate through a reader, assuming its a pyarrow._flight.FlightStreamReader object. Which can be obtained from

reader = client.do_get(flight_info.endpoints[0].ticket, options)

Entire example.py script came from https://github.com/dremio-hub/arrow-flight-client-examples/blob/main/python/example.py

Currently I try reader.read_pandas() so that it will generate a dataframe for the entire Dremio results. Unfortunately, if the query has over 50 million rows or so, it may not fit to the dataframe/or may not have enough memory for it, and my process just gets killed. How do I iterate through the reader object and just get chunks so I can maybe generate dataframe per chunk.

When I use

for chunk in reader.read_chunk():
    print(chunk.to_pandas())

for the first chunk, it will convert/extract only 3968 rows from the results and put it in a dataframe, but for the second chunk its a None object. My example really has a millions rows.

In short how can I iterate through the reader per specified chunk size? And is it possible to print these chunks per rows without converting it to a dataframe?

1

There are 1 best solutions below

0
On

I wrote the following and it works great

while True:
 try:
  batch, buf = reader.read_chunk()
  yield batch
 except StopIteration:
  break

The cunsumer function doing the batch.to_pandas()

The missing part is how to configure the chunk size