I am using PyHive with SQLAlchemy DB-API (asynchronous). My Hive table has millions of records, if I execute:
SELECT * FROM table
It loads millions of records in the memory. Is there a way to limit the size of query result to certain size, let's say 5 MB?
I discard all the results more than 5 MB. Millions of record can be 100 MB but I do want to load these much data in my Python code.
Example, if:
SELECT * FROM table
1000 rows = 5 MB for table, SQLAlchemy will only return me 1000 rows.
For another query, example:
SELECT * FROM table2
here, 100,000 rows = 5 MB for table2, SQLAlchemy will return 100,000 rows.
This way I do not run out of memory on my client. Example are simple but there might be complex join SQL and I want to limit the query result size based on MB of data.
Is there a way to achieve it with SQLAlchemy or need to modify Hive & Presto server to achieve such behavior?