Reading HDFS small size partitions?

409 Views Asked by developforacause At 03 June 2020 at 00:23

Our data loads into hdfs with partition columns as date daily. The issue is each partition has small file size less than 50mb. So when we read the data from all these partition to load the data to next table take hours. How can we address this issue?

Original Q&A

There are 1 best solutions below

Ram Ghadiyaram On 03 June 2020 at 04:25

I'd suggest you to run end of the day job to coalesce/combine and make a large file which is significantly bigger in size for processing in spark, before reading from spark.

Further reading cloudera blog/docs to address these problems Partition Management in Hadoop where several techniques were discussed to address these problems like

1. Merge partitions on selected tables
1. Archiving cold data
1. Delete partitions

Select one of the technique discussed in cloudera blog to match your kind of requirements. Hope this helps!

Other good options Typical use case is using open source delta lake/ if you are using databricks go for their delta lake for getting rich set of features...

Example maven coordinates.

<dependency>
  <groupId>io.delta</groupId>
  <artifactId>delta-core_2.11</artifactId>
  <version>0.6.1</version>
</dependency>

using delta. lake you can insert/update/delete the data as you want. it will reduce maintenance steps...

Compacting Small Files in Delta Lakes

Reading HDFS small size partitions?

There are 1 best solutions below

Related Questions in JAVA

Related Questions in SCALA

Related Questions in APACHE-SPARK

Related Questions in CLOUDERA-CDH

Related Questions in SPARK2.4.4

Trending Questions

Popular # Hahtags

Popular Questions