Why Zeppelin notebook is not able to connect to S3

3.6k Views Asked by At

I have installed Zeppelin, on my aws EC2 machine to connect to my spark cluster.

Spark Version: Standalone: spark-1.2.1-bin-hadoop1.tgz

I am able to connect to spark cluster but getting following error, when trying to access the file in S3 in my usecase.

Code:

    sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "YOUR_KEY_ID")
    sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey","YOUR_SEC_KEY")
    val file = "s3n://<bucket>/<key>"
    val data = sc.textFile(file)
    data.count


file: String = s3n://<bucket>/<key>
data: org.apache.spark.rdd.RDD[String] = s3n://<bucket>/<key> MappedRDD[1] at textFile at <console>:21
ava.lang.NoSuchMethodError: org.jets3t.service.impl.rest.httpclient.RestS3Service.<init>(Lorg/jets3t/service/security/AWSCredentials;)V
    at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.initialize(Jets3tNativeFileSystemStore.java:55)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:85)

I have build the Zeppelin by following command:

mvn clean package -Pspark-1.2.1 -Dhadoop.version=1.0.4 -DskipTests

when I trying to build with hadoop profile "-Phadoop-1.0.4", it is giving warning that it doesn't exist.

I have also tried -Phadoop-1 mentioned in this spark website. but got the same error. 1.x to 2.1.x hadoop-1

Please let me know what I am missing here.

2

There are 2 best solutions below

0
On

The following installation worked for me (spent also many days with the problem):

  1. Spark 1.3.1 prebuild for Hadoop 2.3 setup on EC2-cluster

  2. git clone https://github.com/apache/incubator-zeppelin.git (date: 25.07.2015)

  3. installed zeppelin via the following command (belonging to instructions on https://github.com/apache/incubator-zeppelin):

    mvn clean package -Pspark-1.3 -Dhadoop.version=2.3.0 -Phadoop-2.3 -DskipTests

  4. Port change via "conf/zeppelin-site.xml" to 8082 (Spark uses Port 8080)

After this installation steps my notebook worked with S3 files:

sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "xxx")
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey","xxx")
val file = "s3n://<<bucket>>/<<file>>"
val data = sc.textFile(file)
data.first

I think that the S3 problem is not resolved completely in Zeppelin Version 0.5.0, so cloning the actual git-repo did it for me.

Important Information: The job only worked for me with zeppelin spark-interpreter setting master=local[*] (instead of using spark://master:7777)

0
On

For me it worked in one two steps-

1. creating sqlContext -
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
2. reading s3 files like this. - 
val performanceFactor = sqlContext.
      read.  parquet("s3n://<accessKey>:<secretKey>@mybucket/myfile/")

where access key and secret key you need to supply. in #2 I am using s3n protocol and access and secret keys in path itself.