Multiple compression on same hive table

870 Views Asked by At

I have a Hive table partitioned by Year/Month and it contains data for at least 7 years. What I want to do it compress the latest data (like upto 1 year old) through Snappy but the older data through a better compression technique like gzip etc. How can I perform this in Hive?

1

There are 1 best solutions below

2
On

You can overwrite different partitions with different compression settings. Configure compression codec and do insert overwrite partitions you want to be compressed using this codec.

For snappy:

set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.SnappyCodec; 
set mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;

Overwrite partitions using snappy codec:

--enable dynamic partitions
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;

--use where conditions to limit new data only
insert overwrite table table_name partition(partition_key)
select * from table_name where partition_key <your condition here> ;

For gzip use GzipCodec:

set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec; 
set mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.GzipCodec;

Overwrite partitions using gzipcodec:

--enable dynamic partitions
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;

--use where conditions to limit OLD data only
insert overwrite table table_name partition(partition_key)
select * from table_name where partition_key <your condition here> ;

In such way you can use different compression codecs for different partitions. And you do not need to specify codec during selecting these table. Hive will automatically recognize which codec should be used for decompression.

Of course this is not about self-contained file formats like ORC or parquet. They can have their own compression properties. For example orc.compress=SNAPPY