I want to create an Apache Spark DataFrame from a S3 resource. I've tried on AWS and on IBM S3 Clout Object Store, both fail with
org.apache.spark.util.TaskCompletionListenerException: Premature end of Content-Length delimited message body (expected: 2,250,236; received: 16,360)
I'm running pyspark with
./pyspark --packages com.amazonaws:aws-java-sdk-pom:1.11.828,org.apache.hadoop:hadoop-aws:2.7.0
I'm setting the S3 configuration for IBM with
sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", "xx")
sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "xx")
sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "s3.eu-de.cloud-object-storage.appdomain.cloud")
Or AWS with
sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", "xx")
sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", " xx ")
sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "s3.us-west-2.amazonaws.com")
In both cases the following code: df=spark.read.csv("s3a://drill-test/cases.csv")
It fails with the exception
org.apache.spark.util.TaskCompletionListenerException: Premature end of Content-Length delimited message body (expected: 2,250,236; received: 16,360)
First of all you should take a look at the exception it doesn't provide information
https://spark.apache.org/docs/1.2.2/api/java/org/apache/spark/util/TaskCompletionListenerException.html
There is one case that I can think of which is user permission error from s3 and IBM cloud both. Are you accessing the public link on s3 or is it a private link if it is then. You should dig deep on the link permissions.