Using Hive to distribute over Reducers?

602 Views Asked by At

The most frustrating part about this problem, is that the obvious answer is "fix the source table!" - which unfortunately I cannot do (this is managed and maintained by another team at work who refuses to help).

So I'm looking for a technical solution to doing this without changing the source table.

The situation is this: I have a source table, and I'm trying to write a hive query to create a new table. The query ends up taking many hours to complete, and the reason is that the work gets bottle-necked into a single reducer.

When I follow the source table to its location on hdfs, I notice there are 1009 part files. 1008 of them are 0 bytes, and 1 of them is 400 GB.

This explains why 1 reducer takes so long, because all of the data is contained in a single file.

I have tried to add the following settings in an attempt to split the work over many reducers.

set hive.merge.mapfiles=true; 
set hive.merge.mapredfiles=true;
set hive.merge.smallfiles.avgsize=134217728;
set hive.merge.size.per.task=134217728;
set mapred.max.split.size=134217728;
set mapred.min.split.size=134217728;
set hive.exec.reducers.bytes.per.reducer=134217728;

All attempts end with my new table looking exactly like the source table, with tons of 0 byte files, and a single file with all of the data. I'm able to control the reducers, which controls the total number of files... but I cannot control the data to have the result evenly distributed.

Any ideas on how I can "fix" my resulting table to have evenly distributed files? Bonus on if I can fix this during the query process which would even the load on my reducers and make the query faster.

The source table looks like this:

CREATE TABLE `source_tbl`(
 `col1` varchar(16)
, `col2` smallint
, `col3` varchar(5),
... many more cols ...
`col20000` int) 
ROW FORMAT SERDE 
'org.apache.hadoop.hive.ql.io.orc.OrcSerde' 
STORED AS INPUTFORMAT 
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' 
OUTPUTFORMAT 
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' 
LOCATION 
'hdfs://cluster/user/hive/warehouse/schema.db/source_tbl' 
TBLPROPERTIES ( 
'COLUMN_STATS_ACCURATE'='true', 
'numFiles'='1009', 
'numRows'='19187489', 
'rawDataSize'='2972053294998', 
'totalSize'='50796390931', 
'transient_lastDdlTime'='1501859524') 

My query is this:

create table schema.dest_tbl as select * from schema.source_tbl;
0

There are 0 best solutions below