I have partitioned parquet files in Azure Blob that I am copying to Azure SQL. How do I get the partition name into the SQL table?
I've figured out how to get the full file path into the SQL table by adding an Additional Column in the source data section of the Copy Activity (image 1 & 2), but I'm trying to figure out how to regex the full file path down to just the partition name (202105).
In the data preview for the source data in the Copy Activity, it shows the time_period column with just the partition name (image 3). But when it shows up in SQL it is NULL for all rows (or it's the full file path, depending on if I added Additional Columns in the source data section of the Copy Activity).
I've tried changing the data type for time_period to an INT in Azure SQL. I've tried parsing the $$FILEPATH
, but nothing I've tried has worked.
I'm basically starting from scratch as I'm sure there's a better. Extra background here and possibly here.
Similar to this
As explained here in MS doc, you can utilize
enablePartitionDiscovery
feature.Source: partitioned files:
Source Dataset:
Just mentioned the container name and leave the directory and file fields empty. We shall filter them using
WildCard paths
in Copy Activity.Configure source in
Copy Activity
with respect to your files path:Note: you can skip the step 4 i.e. additional column with
$$FILEPATH
, just shown for reference. You can drop this bit as you already get the ready column usingenablePartitionDiscovery
.For a single folder to be picked, you will set as below.
Wildcard paths:
sink / columnparts / time_period=202105 / *.parquet
For multiple folders
time_period=202105
,time_period=202106
..... as seen in previous sinp, set as below.**
will take the place of any folder in the parent foldercolumnparts
Wildcard paths:
sink / columnparts / ** / *.parquet
Partition root Path: This should point to the parent folder where all the partitioned folders rest.
In my example:
sink/columnparts
partition root path must be provided when you enable partition discovery.
Sink: Optional update existing table or just create a new one.
View from SQL DB:
time_period
column holds the value202105
If you see this error:
You have a mapping that is not updated! In the mapping section, you can
clear
orreset
schema andImport schema
again just to be sure.In my case it was additional column
file_path
--OR--
$$FILEPATH
is a reserved variable, you cannot use it in expression builder or in functions to manipulate.Instead if you can incorporate a step after you copy to SQL DB i.e. use a stored procedure as below.
Where column
path
holds the full file path received from$$FILEPATH
as you have managed already.StoreParquetTest
is the table created in SQLsink
Now you can use the stored procedure activity in Pipeline after Copy Activity.