Slow performance reading large Hive table with pyhive in comparison with RJDBC

766 Views Asked by At

I'm trying to read a large table from Hive in python using pyhive, the table has about 16 millions of rows. But it is taking about to 33 minutes. When I read the same table in R with RJDBC it takes about 13 minutes to read the whole table. Here is my code

library(RJDBC)

driver <- try(JDBC("org.apache.hive.jdbc.HiveDriver", paste0(jar_dir, '/hive-jdbc-3.1.2-standalone.jar')))
con_hive <- RJDBC::dbConnect(driver, "jdbc:hive2://hive_ip:10000/dev_perm")
query <- "SELECT * FROM mi table WHERE periodo='2020-02-01'"
replica_data <- dbGetQuery(con_hive, query)

And in python my code is

import pyhive
conn = hive.Connection(host=ip_hive)
curs = conn.cursor()
cursor.execute("SELECT * FROM mi table WHERE periodo='2020-02-01'")
results = pd.DataFrame(cursor.fetchall(), columns=[desc[0] for desc in cursor.description])

I already tried to set multiple cursor.arraysize in python but it doesn't improve performace and also I notice when I set a arraysize greater than 10000 hive ignores it and set 10000. The default value is 1000.

What can I do to improve my performace reading Hive tables in python?

0

There are 0 best solutions below