Our data loads into hdfs with partition columns as date daily. The issue is each partition has small file size less than 50mb. So when we read the data from all these partition to load the data to next table take hours. How can we address this issue?
Reading HDFS small size partitions?
407 Views Asked by developforacause At
1
There are 1 best solutions below
Related Questions in JAVA
- Add image to JCheckBoxMenuItem
- How to access invisible Unordered List element with Selenium WebDriver using Java
- Inheritance in Java, apparent type vs actual type
- Java catch the ball Game
- Access objects variable & method by name
- GridBagLayout is displaying JTextField and JTextArea as short, vertical lines
- Perform a task each interval
- Compound classes stored in an array are not accessible in selenium java
- How to avoid concurrent access to a resource?
- Why does processing goes slower on implementing try catch block in java?
- Redirect inside java interceptor
- Push toolbar content below statusbar
- Animation in Java on top of JPanel
- JPA - How to query with a LIKE operator in combination with an AttributeConverter
- Java Assign a Value to an array cell
Related Questions in SCALA
- Spark .mapValues setup with multiple values
- Where do 'normal' println go in a scala jar, under Spark
- Serializing to disk and deserializing Scala objects using Pickling
- Where has "Show Type Info on Mouse Motion" gone in Intellij 14
- AbstractMethodError when mixing in trait nested in object - only when compiled and imported
- Scala POJO Aggregator Exception
- How to read in numbers from n lines into a Scala list?
- Spark pairRDD not working
- Scala Eclipse IDE compiler giving errors until "clean" is run
- How to port Slick 2.1 plain SQL queries to Slick 3.0
- Log of dependency does not show
- Getting unary error for escaped characters in Scala
- Akka actor invoked with a function delegate - is this bad practice?
- Json implicit format with recursive class definition
- How to create a executable jar reading files from local file system
Related Questions in APACHE-SPARK
- Spark .mapValues setup with multiple values
- Where do 'normal' println go in a scala jar, under Spark
- How to query JSON data according to JSON array's size with Spark SQL?
- How do I set the Hive user to something different than the Spark user from within a Spark program?
- How to add a new event to Apache Spark Event Log
- Spark streaming + kafka throughput
- dataframe or sqlctx (sqlcontext) generated "Trying to call a package" error
- Spark pairRDD not working
- How to know which worker a partition is executed at?
- Using HDFS with Apache Spark on Amazon EC2
- How to create a executable jar reading files from local file system
- How to keep a SQLContext instance alive in a spark streaming application's life cycle?
- Cassandra spark connector data loss
- Proper way to provide spark application a parameter/arg with spaces in spark-submit
- sorting RDD elements
Related Questions in CLOUDERA-CDH
- Apache flume Regex Extractor Interceptor
- Hadoop MapReduce (Yarn) using hosts with different power/specifications
- How to specify the mappers failure threshold for a hadoop mapreduce job?
- Hadoop Yarn job: Wrong FS
- YARN Memory Allocation Parameters
- saveAsTable in Spark 1.4 is not working as expected
- Tablet Server Access for Accumulo Running on AWS
- Decommissioning multiple Hadoop DataNodes in parallel
- Creating a dynamic resource pool for yarn through Cloudera Manager REST api
- Not able to install hadoop using Cloudera Manager
- Can not find "Spark.dynamicAllocation.enabled" property in CDH 5.4.7 UI
- Use Flume to stream a webpage data to HDFS
- Can not query struct field with hive (CDH 5.9.0)
- Cloudera manager while installation , get HostName
- Mistaked code after execute a job using tPigLoad, tPigMap and tPigStoreResult
Related Questions in SPARK2.4.4
- Avoid Broadcast Nested Loop Join in Pyspark when the joining condition has a OR clause
- Spark2.4 Unable to overwrite table from same table
- Can we set up both Spark2.4 and Spark3.0 in single system?
- Convert Spark2.2's UDAF to 3.0 Aggregator
- Extension of compressed parquet file in Spark
- In pyspark 2.4, how to handle columns with the same name resulting of a self join?
- spark not downloading hive_metastore jars
- Missing methods in PySpark 2.4's pyspark.sql.functions but still works in local environment
- Spark 3.3.1 picking up current date automatically in data frame if date is missing from given timestamp and not marking it as _corrupt record
- Reading HDFS small size partitions?
- Output Spark application name in driver log
- How to set Spark timeout ( application Killing itself )
- Specific Spark write operation gradually increase with time in streaming applicaiton
- Change spark _temporary directory path to avoid deletion of parquets
- Hive beeline and spark load count doesn't match for hive tables
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
I'd suggest you to run end of the day job to coalesce/combine and make a large file which is significantly bigger in size for processing in spark, before reading from spark.
Further reading cloudera blog/docs to address these problems Partition Management in Hadoop where several techniques were discussed to address these problems like
Select one of the technique discussed in cloudera blog to match your kind of requirements. Hope this helps!
Other good options Typical use case is using open source delta lake/ if you are using databricks go for their delta lake for getting rich set of features...
Example maven coordinates.
using delta. lake you can insert/update/delete the data as you want. it will reduce maintenance steps...
Compacting Small Files in Delta Lakes