I am processing approximately 19,710 directories containing IIS log files in an Azure Synapse Spark notebook. There are 3 IIS log files in each directory. The notebook reads the 3 files located in the directory and converts them from text delimited to Parquet. No partitioning. But occasionally I get the following two errors for no apparent reason.
{
"errorCode": "2011",
"message": "An error occurred while sending the request.",
"failureType": "UserError",
"target": "Call Convert IIS To Raw Data Parquet",
"details": []
}
When I get the error above all of the data was successfully written to the appropriate folder in Azure Data Lake Storage Gen2.
{
"errorCode": "6002",
"message": "(3,17): error CS0234: The type or namespace name 'Spark' does not exist in the namespace 'Microsoft' (are you missing an assembly reference?)\n(4,17): error CS0234: The type or namespace name 'Spark' does not exist in the namespace 'Microsoft' (are you missing an assembly reference?)\n(12,13): error CS0103: The name 'spark' does not exist in the current context",
"failureType": "UserError",
"target": "Call Convert IIS To Raw Data Parquet",
"details": []
}
When I get the error above none of the data was successfully written to the appropriate folder in Azure Data Lake Storage Gen2.
In both cases you can see that the notebook did run for a period of time. I have enabled 1 retry on the spark notebook, it is a pyspark notebook that does python for the parameters with the remainder of the logic using C# %%csharp. The spark pool is small (4 cores/ 32GB) with 5 nodes.
The only conversion going on in the notebook is converting a string column to a timestamp.
var dfConverted = dfparquetTemp.WithColumn("Timestamp",Col("Timestamp").Cast("timestamp"));
When I say this is random the pipeline is currently running and after processing 215 directories there are 2 of the first failure and one of the second.
Any ideas or suggestions would be appreciated.


Well I think this is part of the issue. Keep in mind that I am writing the main part of the logic in C# so your mileage in another language may vary. Also these are IIS log files that are space delimited and they can be multiple megabytes in size like one file could be 30MB.
My new code has been running for 17 hours without a single error. All of the changes I made were to ensure that I disposed of resources that would consume memory. Examples follow:
When reading a text delimited file as a binary file
the data in the byte[] eventually gets loaded into a
List<GenericRow>but I never set the variable rawData to null.After filling the byte[] from data frame above I added
After fully putting all data into
List<GenericRow> rowsfrom the byte[] and adding it into a data frame using the code below I cleared out the rows variable.finally after changing a column type and writing out the data I did an unpersist on the data frame.
finally I have most of my logic inside of a C# method that gets called in a foreach loop with the hopes that the CLR will dispose of anything else I missed.
And last but not least a lesson learned.
So in order to process multiple text delimited files out of a folder I had to pass in the names of the multiple files and process the first file with an SaveMode.Overwrite and the other files as SaveMode.Append. Every method of attempting to use any kind of wild card and specifying the directory name only ever resulted in reading one file into the data frame. (Trust me here after hours of GoogleFu I tried every method I could find.)
Again 17 hours into processing not one single error so one important lesson seems to be to keep your memory usage as low as possible.