I have an audit table in Databricks Delta Lake with four fields: id, task_name, start_time, and end_time. The purpose of this table is to capture the start and end times of each job. However, I am currently facing concurrency issues when running five notebooks in parallel, resulting in conflicts during insertion and updating. To address the update concurrency problem, I have partitioned the audit table based on the task_name field and yet to test it. I am now encountering difficulties with concurrent row insertion. I am seeking a concurrent-safe logic for generating ID values without relying on the Delta table's identity property, as it presents issues in the Delta table. I would greatly appreciate any suggestions you can provide.
Concurrency issue in Databricks Delta lake
405 Views Asked by SK ASIF ALI At
1
There are 1 best solutions below
Related Questions in CONCURRENCY
- Unexpected inter-thread happens-before relationships from relaxed memory ordering
- Multiple Processes, Multiple Processors, Single Priority Queue - Java Thread-Safe and Concurrency -
- Efficiently processing many small elements of a collection concurrently in Java
- Zig Concurrency Vs Erlang Concurrency, is Zig less efficient than Erlang?
- Two Update statements on a row are running simultaneously with no locking in MYSQL
- How to Identify Specific Transaction Anomalies in a Given Schedule?
- How can I improve concurrent message processing with Google Task Queue?
- Why does the following program printf "thread 1 exists" twice in WSL2?
- ModelState.IsValid is false when its Data Model Concurrency Token is non nullable
- .NET A second operation was started on this context instance before a previous operation completed
- Can someone tell me what's wrong with mi Task.await?
- I am a beginner. More than problems, I have ideas I share my code ;D. NO RULES
- Understanding Potential Deadlock in Resource Pool Implementation Described in "Go in Action"
- Why are pre-allocated stacks expensive, given 64-bit virtual memory?
- Concurrency issues with server-sent events in Python
Related Questions in DATABRICKS
- Generate Databricks personal access token using REST API
- Databricks Delta table / Compute job
- Problem to add service principal permissions with terraform
- Spark connectors from Azure Databricks to Snowflake using AzureAD login
- SparkException: Task failed while writing rows, caused by Futures timed out
- databricks-connect==14.3 does not recognize cluster
- Connect and track mlflow runs on databricks
- Databricks can't find a csv file inside a wheel I installed when running from a Databricks Notebook
- How to override a is_member() in-built function in databricks
- Last SPARK Task taking forever to complete
- Call Databricks API from an ASP.NET Core web application
- Access df_loaded and/or run_id in Load Data section of best trial notebook of Databricks AutoML run
- How to avoid being struct column name written to the json file?
- Understanding least common type in databricks
- Azure DataBricks - Looking to query "workflows" related logs in Log Analytics (ie Name, CreatedBy, RecentRuns, Status, StartTime, Job)
Related Questions in DELTA-LAKE
- Existing column unrecognized by Delta merge
- Writing on Delta Table with Change Data Feed enabled
- Programatically querying Delta Table via Athena is failing
- Delta Lake as ingress for Flink Stateful Functions
- Optimise Command on Delta Table
- Azure SQL support for Delta tables
- Executing Spark sql in delta live tables
- New delta log folder is not getting created
- Adding column metadata comments in delta live table
- org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus can't be cast to org.apache.spark.sql.execution.datasources.FileStatusWithMetadat
- Databricks AutoLoader - how to handle spark write transactional (_SUCCESS file) on Azure Data Lake Storage?
- pyspark casting missing struct in optional array for delta table
- How to Refresh Unity Catalog Table MataStore
- How to drop or skip data type mismatch while reading from Mongo using Spark Mongo Connector
- Apache Delta upsert vs insert/delete
Related Questions in ID-GENERATION
- How can I have multiple ways of generating rawId in spring for an entitiy in hibernate?
- Concurrency issue in Databricks Delta lake
- Are snowflake ids generated on multiple machines sortable
- Generate permanent ID per a Windows computer
- MongoDB - Insert 2 documents, each in another collection but share ObjectId
- How create a custom prefixed id jpa hibernate with @TableGenerator and reset the counter every year per location
- Assign and maintain sequential Worker-Number or NodeId in Kubernetes
- Google Meet IDs generation mechanism?
- Assign Incrementing uint ID in constructor in C#
- Is this a correct solution to the Spring Data JDBC problem of the insert/update?
- Hibernate: Own ID-Generator Sequence for every entity
- ID Generation by Sequence On Oracle Golden Gate
- How to get the generated key for a column with lowercase characters from Oracle using JdbcTemplate (or plain JDBC)
- How to manually set a value for @GeneratedValue
- org.hibernate.MappingException: The increment size of the sequence is set to [10] in the entity mapping while ... size is [1]
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
Even if you do the partitioning, you still need to have a condition on the specific partition value, not only on
source.partition = dest.partition- it should besource.partition = dest.partition AND dest.partition = 'job_name'. That's is demonstrated int he delta lake documentation. But this will generate quite many partitions with small files that will harm the performance when you access your data.But you can avoid conflicts in the delta table if you switch to the append-only solution, where you will append starts & stops as individual rows, and then have a view on top of that table to find the latest status. Something like this: