I'm not sure on the concept of memory foot print. When loading a parquet file of eg. 1GB and creating RDDs out of it in Spark, What would be the memory food print for each RDD?
RDD Memory footprint in spark
1.6k Views Asked by spark_dream At
2
There are 2 best solutions below
Related Questions in APACHE-SPARK
- Spark .mapValues setup with multiple values
- Where do 'normal' println go in a scala jar, under Spark
- How to query JSON data according to JSON array's size with Spark SQL?
- How do I set the Hive user to something different than the Spark user from within a Spark program?
- How to add a new event to Apache Spark Event Log
- Spark streaming + kafka throughput
- dataframe or sqlctx (sqlcontext) generated "Trying to call a package" error
- Spark pairRDD not working
- How to know which worker a partition is executed at?
- Using HDFS with Apache Spark on Amazon EC2
- How to create a executable jar reading files from local file system
- How to keep a SQLContext instance alive in a spark streaming application's life cycle?
- Cassandra spark connector data loss
- Proper way to provide spark application a parameter/arg with spaces in spark-submit
- sorting RDD elements
Related Questions in COMPRESSION
- How to use deflate/inflate SetDictionary with raw deflate/inflate?
- C# How to get file/ copy file from a bzip2 (.bz2) file without extracting the file
- How can I compress four floats into a string?
- Create ZIP File Then Send to Client
- compress json data from rest node.js use express compression
- Advanced Data Compression
- Tools to minify CDD and JS files
- How to use multiple threads for zlib compression (same input source)
- Data compression in RDBMS like Oracle, MySQL etc
- Haskell - Lempel-Ziv 78 Compression - very slow, why?
- Python: how to create tar file and compress it on the fly with external module, using different compression methods not available in tarfile module?
- Why isn't lossless compression automatic on computers?
- PHP Image Compression Before Upload
- Compression of char size integer by removing leading zeroes
- BMP Image Compression and Decompression in java
Related Questions in RDD
- How to know which worker a partition is executed at?
- sorting RDD elements
- Spark: Work around nested RDD
- Spark - Combinations without repetition
- Map a table of a cassandra database using spark and RDD
- When using textFile to create an RDD in Spark, what is the index that is displayed in the result?
- In Spark, what is left in the memory after a Job is done?
- Unexpected spark caching behavior
- Python function to convert a list RDD into a pair RDD with the unique words and their count?
- Scala - sort RDD partitions
- Spark read file from S3 using sc.textFile ("s3n://...)
- Avoid RDD nested in Spark without Array
- Spark rdd write in global list
- ReduceByKey with a byte array as the key
- apache spark - creating RDD from Iterable from groupByKey results
Related Questions in PARQUET
- Spark with Avro, Kryo and Parquet
- Set parquet snappy output file size is hive?
- Getting error,Error: org.kitesdk.data.DatasetIOException: Cannot decode Avro value
- Got exception running Sqoop: java.lang.NullPointerException using -query and --as-parquetfile
- bit vector intersect in handling parquet file format
- Spark: error reading DateType columns in partitioned parquet data
- export parquet format data to mysql using sqoop
- Hive - How to print the classpath of a Hive service
- Flink Avro Parquet Writer in RollingSink
- How to convert parquet file to Avro file?
- from java objects to parquet file
- Spark empty _metadata file in parquet output
- java.lang.NoSuchMethodError: com.microsoft.azure.storage.core.StorageCredentialsHelper.signBlobAndQueueRequest
- Reading/writing with Avro schemas AND Parquet format in SparkSQL
- Partial Vertical Caching of DataFrame
Related Questions in MEMORY-FOOTPRINT
- Get the current memory usage of a variable?
- Program memory footprint for different interpreters/compilers
- Proper Android Activity Cleanup to Reduce Memory Footprint
- Reduce memory footprint in the construction of a BigInt class
- Making C#/.NET have a small footprint?
- Speed and Memory managment of C vs Perl
- Calculation of memory footprint
- Retrieve memory usage and CPU usage from jvm
- Eclipse IDE High Memory Footprint
- Laravel Queue Worker Memory Footprint is Too Big :/
- Windows 8 Store App - Memory footprint too high
- I need to find the object quickly while using as little memory as possible. What data container should I use?
- docker-compose with similar images
- Usefulness of unsigned char in on embedded systems in C
- RDD Memory footprint in spark
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
When you create an RDD out of a parquet file, nothing will be loaded/executed until you run an action (e.g., first, collect) on the RDD.
Now your memory footprint will most likely vary over time. Say you have 100 partitions and they are equally-sized (10 MB each). Say you are running on a cluster with 20 cores, then at any point in time you only need to have
10MB x 20 = 200MBdata in memory.To add on top of this, given that Java objects tend to take more space, it's not easy to say exactly how much space your 1GB file will take in the JVM Heap (assuming you load the entire file). It could me 2x or it can be more.
One trick you can do to test this is force your RDD to be cached. You can then check in the Spark UI under Storage and see how much space that RDD took to cache.