Issue downloading/parsing ORC File from S3, or from Local Path

66 Views Asked by At

I have an application deployed that is supposed to parse/download an ORC File from an S3 bucket.

I have tried multiple things, one of them being, downloading the File locally in the app, and try to create an OrcReader object using the createReader method form Hadoop, using the Hadoop.fs.Path, passing in as an argument the path to the app local file. But every time I'm getting:

- Unknown error occurred
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.LocalFileSystem not found

My code is:

final GetObjectRequest objectRequest = GetObjectRequest.builder()
                                                           .bucket(s3Bucket)
                                                           .key(fullPath)
                                                           .build();
    try (final ResponseInputStream<GetObjectResponse> responseInputStream = s3Client.getObject(objectRequest);
        final FileOutputStream fileOutputStream = new FileOutputStream(downloadPath)) {

      IOUtils.copyLarge(responseInputStream, fileOutputStream);

      Configuration conf = new Configuration();
      conf.set("fs.hdfs.impl", org.apache.hadoop.hdfs.DistributedFileSystem.class.getName());
      conf.set("fs.https.impl", org.apache.hadoop.fs.LocalFileSystem.class.getName());
      conf.set("fs.https.impl", org.apache.hadoop.fs.http.HttpsFileSystem.class.getName());

      return createReader(new Path(downloadPath.toString()), readerOptions(conf));

But I am still getting the error. This would've been way easier with a CSV, and using BufferedReader but unfortunately that is not the case. Also I don't want to read every line from S3 and copy the contents of the file to a temporary file as this will affect the performance of the application

I do have the orc dependency in my pom, as well as the hadoop-common one.

Any kind of help would be greatly appreciated. Thanks!

0

There are 0 best solutions below