Out of memory when loading into Hive partitioned Parquet table

1.8k Views Asked by At

To state upfront, this is not a production environment of Hadoop. It is a single node environment where we are testing workflow

Problem: The below Hive query fails when attempting to load a Parquet table with a single partition of data. The source table/partition is only a 142MB file. The insert statement generates a single mapper job which eventually fails out with a Java out of memory error. It doesn't seem like this small test case should be creating such overhead?

We only have this issue when trying insert Parquet. Inserting into Avro, Orc, Text do not have issues. Nor are there any issues querying.

I've tried using the following commands, but they only adjust the mappers used in the initial select. The insert stage still uses 1 mapper.

set mapreduce.input.fileinputformat.split.minsize
set mapreduce.input.fileinputformat.split.maxsize
set mapreduce.job.maps

I am on CDH 5.8 / Hadoop 2.6. The VM instance has 4 cores / 24GB RAM allocated to it.

DROP TABLE IF EXISTS web.traffic_pageviews;

CREATE TABLE web.traffic_pageviews(
    SESSION_ID STRING,
    COOKIE_ID STRING,
    TS TIMESTAMP,
    PAGE STRING,
    PAGE_URL_BASE STRING,
    PAGE_URL_QUERY STRING,
    PAGE_REFERRAL_URL_BASE STRING,
    PAGE_REFERRAL_URL_QUERY STRING)
    PARTITIONED BY (DS STRING)
    STORED AS PARQUET;

INSERT OVERWRITE TABLE web.traffic_pageviews PARTITION(ds='2016-12-28')

select
    session_id,
    cookie_id,
    ts,
    page,
    SPLIT(PAGE_URL,'\\?')[0] PAGE_URL_BASE,
    SPLIT(PAGE_URL,'\\?')[1] PAGE_URL_QUERY,
    SPLIT(PAGE_REFERRAL_URL,'\\?')[0] PAGE_REFERRAL_URL_BASE,
    SPLIT(PAGE_REFERRAL_URL,'\\?')[1] PAGE_REFERRAL_URL_QUERY
from    
    web.stg_traffic_pageviews
where
    ds='2016-12-28';

The error output looks as follows. I feel like we are doing something basic wrong and shouldn't be adjusting Java memory allotments?

2017-01-03 07:11:02,053 INFO [main] org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper: real writer: parquet.hadoop.ParquetRecordWriter@755cce4b
2017-01-03 07:11:02,057 INFO [main] org.apache.hadoop.hive.ql.exec.FileSinkOperator: FS[1]: records written - 1
2017-01-03 07:11:02,062 INFO [main] org.apache.hadoop.hive.ql.exec.MapOperator: MAP[2]: records read - 1
2017-01-03 07:11:02,064 INFO [main] org.apache.hadoop.hive.ql.exec.FileSinkOperator: FS[1]: records written - 10
2017-01-03 07:11:02,064 INFO [main] org.apache.hadoop.hive.ql.exec.MapOperator: MAP[2]: records read - 10
2017-01-03 07:11:02,082 INFO [main] org.apache.hadoop.hive.ql.exec.FileSinkOperator: FS[1]: records written - 100
2017-01-03 07:11:02,082 INFO [main] org.apache.hadoop.hive.ql.exec.MapOperator: MAP[2]: records read - 100
2017-01-03 07:11:02,356 INFO [main] org.apache.hadoop.hive.ql.exec.FileSinkOperator: FS[1]: records written - 1000
2017-01-03 07:11:02,356 INFO [main] org.apache.hadoop.hive.ql.exec.MapOperator: MAP[2]: records read - 1000
2017-01-03 07:11:03,775 INFO [main] org.apache.hadoop.hive.ql.exec.FileSinkOperator: FS[1]: records written - 10000
2017-01-03 07:11:03,775 INFO [main] org.apache.hadoop.hive.ql.exec.MapOperator: MAP[2]: records read - 10000
2017-01-03 07:12:03,679 FATAL [LeaseRenewer:[email protected]:8020] org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[LeaseRenewer:[email protected]:8020,5,main] threw an Error.  Shutting down now...
java.lang.OutOfMemoryError: Java heap space
1

There are 1 best solutions below

0
On

Issue resolved itself after specifying compression on the table. Specifically:

CREATE TABLE web.traffic_pageviews(
    ...
    )
    PARTITIONED BY (DS STRING)
    STORED AS PARQUET
    TBLPROPERTIES ("parquet.compression"="SNAPPY");

Though it's the answer, I don't understand why it worked. If anyone has insight it would be appreciated.