Azure data-factory can't load data successfully through PolyBase if the source data in the last column of the first row is null

Question

Azure data-factory can't load data successfully through PolyBase if the source data in the last column of the first row is null

338 Views Asked by user87 At 09 February 2023 at 08:38

I am try using Azure DataFactory to load data from Azure Blob Storage to Azure Data warehouse The relevant data is like below:

source csv:

1,james,
2,john,usa

sink table:

CREATE TABLE test_null (
    id int NOT NULL,
    name nvarchar(128)  NULL,
    address nvarchar(128)  NULL
)

source dataset:

{
    "name": "test_null_input",
    "properties": {
        "linkedServiceName": {
            "referenceName": "StagingBlobStorage",
            "type": "LinkedServiceReference"
        },
        "annotations": [],
        "type": "DelimitedText",
        "typeProperties": {
            "location": {
                "type": "AzureBlobStorageLocation",
                "fileName": "1.csv",
                "folderPath": "test_null",
                "container": "adf"
            },
            "columnDelimiter": ",",
            "escapeChar": "",
            "firstRowAsHeader": false,
            "quoteChar": ""
        },
        "schema": []
    }
}

sink dataset:

{
    "name": "test_null_output",
    "properties": {
        "linkedServiceName": {
            "referenceName": "StagingAzureSqlDW",
            "type": "LinkedServiceReference"
        },
        "annotations": [],
        "type": "AzureSqlDWTable",
        "schema": [
            {
                "name": "id",
                "type": "int",
                "precision": 10
            },
            {
                "name": "name",
                "type": "nvarchar"
            },
            {
                "name": "address",
                "type": "nvarchar"
            }
        ],
        "typeProperties": {
            "schema": "dbo",
            "table": "test_null"
        }
    }
}

pipeline

{
    "name": "test_input",
    "properties": {
        "activities": [
            {
                "name": "Copy data1",
                "type": "Copy",
                "dependsOn": [],
                "policy": {
                    "timeout": "0.12:00:00",
                    "retry": 0,
                    "retryIntervalInSeconds": 30,
                    "secureOutput": false,
                    "secureInput": false
                },
                "userProperties": [],
                "typeProperties": {
                    "source": {
                        "type": "DelimitedTextSource",
                        "storeSettings": {
                            "type": "AzureBlobStorageReadSettings",
                            "recursive": true,
                            "enablePartitionDiscovery": false
                        },
                        "formatSettings": {
                            "type": "DelimitedTextReadSettings"
                        }
                    },
                    "sink": {
                        "type": "SqlDWSink",
                        "allowPolyBase": true,
                        "polyBaseSettings": {
                            "rejectValue": 0,
                            "rejectType": "value",
                            "useTypeDefault": false,
                            "treatEmptyAsNull": true
                        }
                    },
                    "enableStaging": false,
                    "translator": {
                        "type": "TabularTranslator",
                        "mappings": [
                            {
                                "source": {
                                    "ordinal": 1
                                },
                                "sink": {
                                    "name": "id"
                                }
                            },
                            {
                                "source": {
                                    "ordinal": 2
                                },
                                "sink": {
                                    "name": "name"
                                }
                            },
                            {
                                "source": {
                                    "ordinal": 3
                                },
                                "sink": {
                                    "name": "address"
                                }
                            }
                        ]
                    }
                },
                "inputs": [
                    {
                        "referenceName": "test_null_input",
                        "type": "DatasetReference"
                    }
                ],
                "outputs": [
                    {
                        "referenceName": "test_null_output",
                        "type": "DatasetReference"
                    }
                ]
            }
        ],
        "annotations": []
    }
}

The last column for the first row is null so when run the pipeline it pops out the below error:

ErrorCode=UserErrorInvalidColumnMappingColumnNotFound,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Invalid column mapping provided to copy activity: '{"Prop_0":"id","Prop_1":"name","Prop_2":"address"}', Detailed message: Column 'Prop_2' defined in column mapping cannot be found in Source structure.. Check column mapping in table definition.,Source=Microsoft.DataTransfer.Common,'

Tried set the treatEmptyAsNull to true, still the same error. Tried set skipLineCount to 1, it can work well, seems the last column null data in the first row affects the loading of the entire file. But the weirder thing is that it can also work well by enable staging even without setting treatEmptyAsNull and skipLineCount. In my scenario, it is unnecessary to enable it, since it is originally from blob to data warehouse. It seems unreasonable to change from blob to blob and then from blob to data warehouse after enabling, and it will bring additional data movement charges after enabling. I don't know why setting treatEmptyAsNull doesn't work, and then why enabling staging can work，this seems to make no sense？

Original Q&A

There are 1 best solutions below

**Rakesh Govindula** · Answer 1 · 2023-02-09T10:37:23.610000

I have reproduced the above with your Pipeline JSON and got same error.

enter image description here

This error occurred because as per your JSON, this is your copy data mapping between source and sink.

enter image description here

As per the above mapping you should have Prop_0, Prop_1 and Prop_2 as headers. Here, as you didn't check the First Row as header in your source file, it is taking Prop_0, Prop_1 as headers. Since there is a null value in your first Row there is no Prop_2 column and that is the reason it is giving the error for that column.

To resolve it, Give a proper header your file in csv like below.

enter image description here

Then check the First Row as header in the source dataset. It will give the mapping like below when you import.

enter image description here

Now, it will Execute successfully as mine.

Result:

You can see that the empty value taken as NULL in target table.

enter image description here

Azure data-factory can't load data successfully through PolyBase if the source data in the last column of the first row is null

There are 1 best solutions below

Related Questions in AZURE

Related Questions in AZURE-DATA-FACTORY

Related Questions in POLYBASE

Trending Questions

Popular # Hahtags

Popular Questions