Add Azure Blob Partitions to Azure SQL Table

574 Views Asked by At

I have partitioned parquet files in Azure Blob that I am copying to Azure SQL. How do I get the partition name into the SQL table?

I've figured out how to get the full file path into the SQL table by adding an Additional Column in the source data section of the Copy Activity (image 1 & 2), but I'm trying to figure out how to regex the full file path down to just the partition name (202105).

In the data preview for the source data in the Copy Activity, it shows the time_period column with just the partition name (image 3). But when it shows up in SQL it is NULL for all rows (or it's the full file path, depending on if I added Additional Columns in the source data section of the Copy Activity).

Image 1: Image 1 of using $$FILEPATH

Image 2: Image 2 of setting parameter for filepath

Image 3: showing that the column shows up in preview

I've tried changing the data type for time_period to an INT in Azure SQL. I've tried parsing the $$FILEPATH, but nothing I've tried has worked.

I'm basically starting from scratch as I'm sure there's a better. Extra background here and possibly here.

Similar to this

1

There are 1 best solutions below

4
On

As explained here in MS doc, you can utilize enablePartitionDiscovery feature.

Source: partitioned files:

enter image description here

Source Dataset:

Just mentioned the container name and leave the directory and file fields empty. We shall filter them using WildCard paths in Copy Activity.

enter image description here

Configure source in Copy Activity with respect to your files path:

Note: you can skip the step 4 i.e. additional column with $$FILEPATH, just shown for reference. You can drop this bit as you already get the ready column using enablePartitionDiscovery.

enter image description here

For a single folder to be picked, you will set as below.

Wildcard paths: sink / columnparts / time_period=202105 / *.parquet

For multiple folders time_period=202105 , time_period=202106 ..... as seen in previous sinp, set as below.

** will take the place of any folder in the parent folder columnparts

Wildcard paths: sink / columnparts / ** / *.parquet

Partition root Path: This should point to the parent folder where all the partitioned folders rest.

In my example: sink/columnparts

partition root path must be provided when you enable partition discovery.

enter image description here

Sink: Optional update existing table or just create a new one.

enter image description here

View from SQL DB: time_period column holds the value 202105

time_period=202105/part-00004-fcbe0bf5-2c93-45f5-9bb2-2f9089a3e83a-c000.snappy.parquet

enter image description here

If you see this error:

enter image description here

You have a mapping that is not updated! In the mapping section, you can clear or reset schema and Import schema again just to be sure.

In my case it was additional column file_path

enter image description here

--OR--

$$FILEPATH is a reserved variable, you cannot use it in expression builder or in functions to manipulate.

Instead if you can incorporate a step after you copy to SQL DB i.e. use a stored procedure as below.

Where column path holds the full file path received from $$FILEPATH as you have managed already. StoreParquetTest is the table created in SQL sink

CREATE PROCEDURE trimpath
AS
    UPDATE StoreParquetTest
    SET path = SUBSTRING(path,(CHARINDEX('=',path) + 1), ((CHARINDEX('/',path) - CHARINDEX('=',path) -1)))
GO

Now you can use the stored procedure activity in Pipeline after Copy Activity.

enter image description here

enter image description here

enter image description here