The most frustrating part about this problem, is that the obvious answer is "fix the source table!" - which unfortunately I cannot do (this is managed and maintained by another team at work who refuses to help).
So I'm looking for a technical solution to doing this without changing the source table.
The situation is this: I have a source table, and I'm trying to write a hive query to create a new table. The query ends up taking many hours to complete, and the reason is that the work gets bottle-necked into a single reducer.
When I follow the source table to its location on hdfs, I notice there are 1009 part files. 1008 of them are 0 bytes, and 1 of them is 400 GB.
This explains why 1 reducer takes so long, because all of the data is contained in a single file.
I have tried to add the following settings in an attempt to split the work over many reducers.
set hive.merge.mapfiles=true;
set hive.merge.mapredfiles=true;
set hive.merge.smallfiles.avgsize=134217728;
set hive.merge.size.per.task=134217728;
set mapred.max.split.size=134217728;
set mapred.min.split.size=134217728;
set hive.exec.reducers.bytes.per.reducer=134217728;
All attempts end with my new table looking exactly like the source table, with tons of 0 byte files, and a single file with all of the data. I'm able to control the reducers, which controls the total number of files... but I cannot control the data to have the result evenly distributed.
Any ideas on how I can "fix" my resulting table to have evenly distributed files? Bonus on if I can fix this during the query process which would even the load on my reducers and make the query faster.
The source table looks like this:
CREATE TABLE `source_tbl`(
`col1` varchar(16)
, `col2` smallint
, `col3` varchar(5),
... many more cols ...
`col20000` int)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'hdfs://cluster/user/hive/warehouse/schema.db/source_tbl'
TBLPROPERTIES (
'COLUMN_STATS_ACCURATE'='true',
'numFiles'='1009',
'numRows'='19187489',
'rawDataSize'='2972053294998',
'totalSize'='50796390931',
'transient_lastDdlTime'='1501859524')
My query is this:
create table schema.dest_tbl as select * from schema.source_tbl;