I am processing approximately 19,710 directories containing IIS log files in an Azure Synapse Spark notebook. There are 3 IIS log files in each directory. The notebook reads the 3 files located in the directory and converts them from text delimited to Parquet. No partitioning. But occasionally I get the following two errors for no apparent reason.
{
"errorCode": "2011",
"message": "An error occurred while sending the request.",
"failureType": "UserError",
"target": "Call Convert IIS To Raw Data Parquet",
"details": []
}
When I get the error above all of the data was successfully written to the appropriate folder in Azure Data Lake Storage Gen2.
{
"errorCode": "6002",
"message": "(3,17): error CS0234: The type or namespace name 'Spark' does not exist in the namespace 'Microsoft' (are you missing an assembly reference?)\n(4,17): error CS0234: The type or namespace name 'Spark' does not exist in the namespace 'Microsoft' (are you missing an assembly reference?)\n(12,13): error CS0103: The name 'spark' does not exist in the current context",
"failureType": "UserError",
"target": "Call Convert IIS To Raw Data Parquet",
"details": []
}
When I get the error above none of the data was successfully written to the appropriate folder in Azure Data Lake Storage Gen2.
In both cases you can see that the notebook did run for a period of time. I have enabled 1 retry on the spark notebook, it is a pyspark notebook that does python for the parameters with the remainder of the logic using C# %%csharp. The spark pool is small (4 cores/ 32GB) with 5 nodes.
The only conversion going on in the notebook is converting a string column to a timestamp.
var dfConverted = dfparquetTemp.WithColumn("Timestamp",Col("Timestamp").Cast("timestamp"));
When I say this is random the pipeline is currently running and after processing 215 directories there are 2 of the first failure and one of the second.
Any ideas or suggestions would be appreciated.
OK after running for 113 hours (its almost done) I am still getting the following errors but it looks like all of the data was written out
Count 1
Count 1
Count 17
Not sure what these errors are about and of course I will rerun the specific data in the pipeline to see if this is a one-off or keeps occurring on this specific data. But it seems as if these errors or occurring after the data as been written to parquet format.