I am using using Spark for ETL purposes. Is there a way to generate loading statistics in Apache Spark (or Spark SQL) e.g. number of records loaded from a text file during the load operation as ETL tools like Datastage usually provide? Because of Spark's lazy execution model, I know that we can get such stats by calling action on RDDs which triggers execution (which means we can gather such stats only "after" loading has been done whereas we want stats as data is being loaded). The logs generated by Spark aren't informational in this context either. Also, calling such actions during the execution of ETL steps will be expensive operations for us and we were wondering if there is a way to have DMVs like functionality in Spark for the said purpose. Is there any workaround for that?
Data Loading statistics in Apache Spark
191 Views Asked by rh979 At
0
There are 0 best solutions below
Related Questions in APACHE-SPARK
- Spark .mapValues setup with multiple values
- Where do 'normal' println go in a scala jar, under Spark
- How to query JSON data according to JSON array's size with Spark SQL?
- How do I set the Hive user to something different than the Spark user from within a Spark program?
- How to add a new event to Apache Spark Event Log
- Spark streaming + kafka throughput
- dataframe or sqlctx (sqlcontext) generated "Trying to call a package" error
- Spark pairRDD not working
- How to know which worker a partition is executed at?
- Using HDFS with Apache Spark on Amazon EC2
- How to create a executable jar reading files from local file system
- How to keep a SQLContext instance alive in a spark streaming application's life cycle?
- Cassandra spark connector data loss
- Proper way to provide spark application a parameter/arg with spaces in spark-submit
- sorting RDD elements
Related Questions in METADATA
- Extract bytes of specific stream from mpegts file using ffmpeg
- Configuring Web Applications for iOS
- Compiler Error: 'C:\Windows\Microsoft.NET\Framework\v4.0.30319\Temporary ASP.NET Files\root\014679fc\1b393534\App_Web_glpoum5i.dll' could not be found
- Rest API to upload an image with customized metadata in office 365
- Transient fields from Hibernate PersistentClass
- Retrieve metadata from a database with EntityFramework
- storing ntfs file metadata for retrieval via webserver and linking with database
- Delete all posts and associated meta data that are not custom post types
- Unsupported field datatype: metadata
- Shibboleth - Secure different URLs with different IdPs
- How to find header section in Magento platform?
- How to know to which columns a foreign key is referencing in Oracle SQL Developer?
- Getting a MySQL table's key and engine information from a statement's metadata using java
- Why does ^metadata 'symbol not work?
- ffmpeg: add album art with fluent-mmpeg
Related Questions in APACHE-SPARK-SQL
- How to query JSON data according to JSON array's size with Spark SQL?
- dataframe or sqlctx (sqlcontext) generated "Trying to call a package" error
- How to keep a SQLContext instance alive in a spark streaming application's life cycle?
- How to setup cassandra and spark
- Where are the API docs for org.apache.spark.sql.cassandra for Spark 1.3.x?
- Spark Cassandra SQL can't perform DataFrame methods on query results
- SparkSQL - accesing nested structures Row( field1, field2=Row(..))
- Cassandra Bulk Load - NoHostAvailableException
- DSE Cassandra Spark Error
- How to add any new library like spark-csv in Apache Spark prebuilt version
- Scala extraction/pattern matching using a companion object
- Error importing types from Spark SQL
- Apache Spark, add an "CASE WHEN ... ELSE ..." calculated column to an existing DataFrame
- one job takes extremely long on multiple left join in Spark-SQL (1.3.1)
- scala.MatchError: in Dataframes
Related Questions in ETL
- Monolithic ETL to distributed/scalable solution and OLAP cube to Elasticsearch/Solr
- How to use component javascript in the Pentahoo Data Integration
- SSIS ETL parallel extraction from a AS400 file
- ETL Hangs - SQL Server in EC2 Machine + SSIS + AWS RDS SQL Server
- Pull Text file to SQL server 2008 table
- SqlAlchemy get all strings (don't cast to boolean or datetime)
- Best / simplest way to transfer data from one Oracle database to another
- Using blank-line delimited records and colon-separated fields in awk
- SSIS dynamic columns validation
- Is it possible to pass parameter inside With Clause in SQL Server SSIS Job?
- Easiest way to import a simple csv file to a graph with OrientDB ETL
- forwarding data from one source to another in real time
- SSIS Variable Scope Issues
- OrientDB ETL with self joined mysql table
- loop row by row from an excel file map to variable
Related Questions in RDD
- How to know which worker a partition is executed at?
- sorting RDD elements
- Spark: Work around nested RDD
- Spark - Combinations without repetition
- Map a table of a cassandra database using spark and RDD
- When using textFile to create an RDD in Spark, what is the index that is displayed in the result?
- In Spark, what is left in the memory after a Job is done?
- Unexpected spark caching behavior
- Python function to convert a list RDD into a pair RDD with the unique words and their count?
- Scala - sort RDD partitions
- Spark read file from S3 using sc.textFile ("s3n://...)
- Avoid RDD nested in Spark without Array
- Spark rdd write in global list
- ReduceByKey with a byte array as the key
- apache spark - creating RDD from Iterable from groupByKey results
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?