We have Spark dataframes partitioned on multiple columns. For example, we have a partner column that can be Google, Facebook, and Bing. And we have a channel column that can be PLA and Text. We would like to run anomaly detection on Google-PLA, Google-TEXT, Facebook-TEXT,... etc. separately because they follow different patterns. So far I've figured out I can configure AnomalyCheckConfig with different filter description and using the same filter when checking for result. But first I need to filter out the data for each partition combo and then to run the anomaly test with its associated filter. One by one in serial. Is there a way to run them in parallel? Can I do addAnomalyCheck() with different AnomalyCheckConfigs multiple times to the whole dataframe and get the Verification result in one run?
Is it possible to run Deequ anomaly detection on multiple partitions separately in parallel
1.1k Views Asked by Sifang At
1
There are 1 best solutions below
Related Questions in PERFORMANCE
- Slow performance on ipad erasing image
- Can Apache Ant be told to cache its XML files?
- What are the pros and cons of the picture element?
- DB candidate as CouchDB/Schema replacement
- python member str performance too slow
- Split a large query (2 days) into pieces to increase the speed in Postgres
- Use GUI displayed results of SQL query vs new queries?
- fastest way to map a large number of longs
- Bash regular expression execution hangs on long expressions
- Why is calling a function so slow in Javascript?
- Performance of element-compare in java collections
- "Capture GPU Frame" in XCode -- iOS only?
- Efficiency penalty of initializing a struct/class within a loop
- Change the rotating speed of the circle when the mouse moves using javascript
- Replace foreach to make loop into queryable
Related Questions in APACHE-SPARK
- Spark .mapValues setup with multiple values
- Where do 'normal' println go in a scala jar, under Spark
- How to query JSON data according to JSON array's size with Spark SQL?
- How do I set the Hive user to something different than the Spark user from within a Spark program?
- How to add a new event to Apache Spark Event Log
- Spark streaming + kafka throughput
- dataframe or sqlctx (sqlcontext) generated "Trying to call a package" error
- Spark pairRDD not working
- How to know which worker a partition is executed at?
- Using HDFS with Apache Spark on Amazon EC2
- How to create a executable jar reading files from local file system
- How to keep a SQLContext instance alive in a spark streaming application's life cycle?
- Cassandra spark connector data loss
- Proper way to provide spark application a parameter/arg with spaces in spark-submit
- sorting RDD elements
Related Questions in ANOMALY-DETECTION
- Fault Detection on time sequence of variable changing (trending) over the time
- Get sparse region of KDE
- Time Dependent Anomaly Detection in Unsupervised Learning
- Training Anomaly detection model on large datasets and chossing the correct model
- Questions about feature selection and data engineering when using H2O autoencoder for anomaly detection
- LSTM Autoencoder for Anomaly detection in time series, correct way to fit model
- Interpreting Anomaly detection R values
- Time Series Anomaly Detection from Data vs Image
- Categorical Embeddings in an Unsupervised Setting for Anomaly Detection
- How to use Isolation Forest in python
- Anomaly Detection Using Keras - Low Prediction Rate
- How to convert percentage to z-score of normal distribution in C/C++?
- uploading data using esp32 to google colab
- The reason of different results of KNN algorithm from PYOD & Sklearn packages
- The best algorithms to detect continuing decrease pattern on conversion data
Related Questions in AMAZON-DEEQU
- Using reflections to access methods in Amazon Deequ
- Pyspark version of Amazon Deequ
- Spark Compatible Data Quality Framework for Narrow Data
- Data Quality Framework in AWS
- Adding new suggestion rule in deequ
- PyDeequ hasPattern fails with 'PatternMatch' object has no attribute '_Check'
- How to pass Cardinality Threshold value for Histogram in Deequ package?
- Is it possible to run Deequ anomaly detection on multiple partitions separately in parallel
- Using Deequ on AWS Glue
- Parsing Deequ Rules from a csv/table dynamically
- How to call Amazon Deequ hasDataType from java
- Load constraints from csv-file (amazon deequ)
- Unit Testing Apache Spark Application with Intellij Results in Error
- How to submit a PyDeequ job from Jupyter Notebook to a Spark/YARN
- Not able to create object of desired type in java
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
If you have the partitioning column in your Spark DataFrame, you can instantiate multiple anomaly checks in a single
VerificationSuiteby specifying where conditions for the quality metric you want to run anomaly detection on. Assuming you want to compute theCompletenessof a columnc1, you can control for the partition withwhere = Some("partition = 'GOOGL'"), for example.