databricks spark-xml fails to read gzipped xml ContentType application/octet-stream

42 Views Asked by At

Reading a gzipped xml file with spark.read.format("xml").option("rowTag", "").load("s3:///<file.gz>").display() returned "OK" (meaning no data)

After downloading and re-uploading the same file to the exact same location, re-running the same command returned a table with data.

After some investigation, I saw that the metadata of the original file using aws s3api head-object --bucket <bucket> --key <key> returned

{
"AcceptRanges": "bytes",
"Expiration": "expiry-date="Sun, 29 Oct 2023 00:00:00 GMT", rule-id="delete_after_10_days"",
"LastModified": "2023-10-18T02:16:36+00:00",
"ContentLength": 24663,
"ETag": ""9292bc4c2d7d4c9ed32389ea2de964ce"",
"ContentEncoding": "gzip",
"ContentType": "application/octet-stream",
"ServerSideEncryption": "AES256",
"Metadata": {}
}

and the metadata AFTER the re-upload

{
"AcceptRanges": "bytes",
"Expiration": "expiry-date="Tue, 31 Oct 2023 00:00:00 GMT", rule-id="delete_after_10_days"",
"LastModified": "2023-10-20T14:36:30+00:00",
"ContentLength": 24958,
"ETag": ""ca8f73c5f9dba53eda22913ecc94632a"",
"ContentType": "application/x-gzip",
"ServerSideEncryption": "AES256",
"Metadata": {}
}

Please note the ContentEncoding and ContentType for the first case And the ContentType for the second case

Somehow, spark fails to read the data if ContentEncoding is gzip AND/OR ContentType is application/octet-stream

Any ideas ?

0

There are 0 best solutions below