I am migrating my ETL code to Python and was using pyhs2, but am going to switch to pyhive since it is actively supported and maintained and no one has taken ownership of pyhs2. My question is how to structure the fetchmany method to iterate over dataset.
Here is how I did it using pyhs2:
while hive_cur.hasMoreRows:
hive_stg_result = hive_cur.fetchmany(size=200000)
hive_stg_df = pd.DataFrame(hive_stg_result)
hive_stg_df[27] = etl_load_key
if len(hive_stg_df) == 0:
call("rm -f /tmp/{0} ".format(filename), shell=True)
print ("No data delta")
else:
print (str(len(hive_stg_df)) + " delta records identified")
for i, row in hive_stg_df.iterrows():
I had fetchmany(size=100000), but it fails when it returns empty set.
hive_stg_result = pyhive_cur.fetchmany(size=100000)
hive_stg_df = pd.DataFrame(hive_stg_result)