Context:
- I am an information architect (not a data engineer, was once a Unix and Oracle developer), so my technical knowledge in Azure is limited to browsing Microsoft documentation.
- The context of this problem is ingesting data from a constantly growing CSV file, in Azure ADLS into an Azure SQL MI database.
- I am designing an Azure data platform that includes a SQL data warehouse with the first source system being a Dynamics 365 application.
- The data warehouse is following Data Vault 2.0 patterns. This is well suited to the transaction log nature of the CSV files.
- This platform is in early development - not in production.
- The CSV files are created and updated (append mode) by an Azure Synapse Link that is exporting dataverse write operations on selected dataverse entities to our ADLS storage account. This service is configured in append mode, so all dataverse write operations (create, update and delate) produce an append action to the entities corresponding CSV file. Each CSV file is essentially a transaction log of the corresponding dataverse entity
- Synapse Link operates in an event based fashion - creating a records in dataverse triggers a CSV append action. Latency is typically a few seconds. There aren't any SLAs (promises), and latency can be several minutes if the API caps are breached.
- The CSV is partitioned Annually. This means the a new CSV file is created at the start of each year and continues to grow throughout the year.
- We are currently trialling ADF as the means of extracting records from the CSV for loading into the data warehouse. We are not wedded to ADF and can consider changing horses.
Request:
I'm searching for an event based solution for ingesting that monitors a source CSV file for new records (appended to the end of the file) and extracts only those new records from the CSV file and then processes each record in sequence which result in one or more SQL insert operations for each new CSV record. If I was back in my old Unix days, I would build a process around the "tail -f" command as the start of the pipeline with the next step an ETL process that processed each record served by the tail command. But I can't figure out how to do this in Azure.
This process will be the pattern for many more similar ingestion processes - there could be approximately one thousand CSV files that need to be processed in this event based - near real time process. I assume one process per CSV file.
Some nonfunctional requirements are speed and efficiency.
- My goal is for an event based solution (low latency = speed),
- that doesn't need to read the entire file every 5 minutes to see if there are changes. This is an inefficient (micro) batch process that will be horribly inefficient (read: expensive - 15,000x redundant processing). This is where the desire for a process like Unix "tail -f" comes to mind. It watches the file for changes, emitting new data as it is appended to the source file. I'd hate to do something like a 'diff' every 5 minutes as this is inefficient and when scaled to thousands of tables will be prohibitively expensive.
One possible solution to your problem is to store each new CSV record as a separate blob.
You will then be able to use Azure Event Grid to raise events when a new blob is created in Blob Storage i.e. use Azure Blob Storage as Event Grid source.
The basic idea is to store the changed CSV data as new blob and have Event Grid wired to Blob Created event. An Azure Function can listen to these events and then only process the new data. For auditing purposes, you can save this data in a separate Append Blob once the CSV processing has been completed.