unable to read s3 files from within aws emr studio notebooks or consoles

759 Views Asked by At

We have an EMR Studio that has an S3 default bucket set, i.e. s3://OurBucketName/Subdirectory/work, and within which we've created a Workspace that is attached to an EC2 cluster running emr-6.10.0 with the following apps installed:

  • Hadoop 3.3.3
  • Hive 3.1.3
  • Hue 4.10.0
  • JupyterEnterpriseGateway 2.6.0
  • JupyterHub 1.5.0
  • MXNet 1.9.1
  • Pig 0.17.0
  • Presto 0.278
  • Spark 3.3.1
  • TensorFlow 2.11.0
  • Zeppelin 0.10.1

We can view, read, and write files from within the (bash) Terminal in our Workspace, which appears to contain a copy of everything inside the s3://OurBucketName/Subdirectory/work S3 bucket at the /home/notebook/work location. That said, we cannot read or write files from within any of the consoles or notebooks.

We have tried a number of filepaths, including relative (~/data/filename.csv), absolute (/home/notebook/work/ProjectName/data/filename.csv), S3 (s3://OurBucketName/Subdirectory/work/ProjectName/data/filename.csv), EMR Shareable Link (https://<NotebookID>.emrnotebooks-prod.us-east-1.amazonaws.com/<NotebookID>/doc/tree/ProjectName/data/filename.csv), and EMR Download Link (https://<NotebookID>.emrnotebooks-prod.us-east-1.amazonaws.com/<NotebookID>/doc/tree/ProjectName/data/filename.csv?_xsrf=<base64>).

The target file definitely exists, can be seen in the file browser on the lefthand side, and can be opened/read/modified from within the Terminal or by any scripts executed by it.

Running the following

offices <- read.csv("~/data/filename.csv", header = TRUE, sep = ",", quote = "\"",dec = ".")

from SparkR notebook located at /home/notebook/work/ProjectName/NotebookInSparkR.ipynb returns

[1] "Error in file(file, \"rt\"): cannot open the connection\n----LIVY_END_OF_ERROR----" Warning message: In file(file, "rt") : cannot open file '~/data/filename.csv': No such file or directory,

and running the following

offices <- read.csv("/home/notebook/work/ProjectName/data/filename.csv", header = TRUE, sep = ",", quote = "\"",dec = ".")

from SparkR notebook located at /home/notebook/work/ProjectName/NotebookInSparkR.ipynb returns

[1] "Error in file(file, \"rt\"): cannot open the connection\n----LIVY_END_OF_ERROR----" Warning message: In file(file, "rt") : cannot open file '/home/notebook/work/ProjectName/data/filename.csv': No such file or directory,

and running the following

offices <- read.csv("s3://OurBucketName/Subdirectory/work/ProjectName/data/filename.csv", header = TRUE, sep = ",", quote = "\"",dec = "."),

from SparkR notebook located at /home/notebook/work/ProjectName/NotebookInSparkR.ipynb returns

[1] "Error in file(file, \"rt\"): cannot open the connection\n----LIVY_END_OF_ERROR----" Warning message: In file(file, "rt") : cannot open file 's3://OurBucketName/Subdirectory/work/ProjectName/data/filename.csv': No such file or directory;

whereas running either

offices <- read.csv("https://<NotebookID>.emrnotebooks-prod.us-east-1.amazonaws.com/<NotebookID>/doc/tree/ProjectName/data/filename.csv", header = TRUE, sep = ",", quote = "\"",dec = ".")

or

offices <- read.csv("https://<NotebookID>.emrnotebooks-prod.us-east-1.amazonaws.com/<NotebookID>/doc/tree/ProjectName/data/filename.csv?_xsrf=<base64>", header = TRUE, sep = ",", quote = "\"",dec = ".")

run seemingly without error, however, appear to read a blank html file because running

summary(offices)

from SparkR notebook located at /home/notebook/work/ProjectName/NotebookInSparkR.ipynb returns

X..DOCTYPE.html. Length:28 Class :character Mode :character .

Lastly, it appears that the associated (Python, PySpark, Spark, or SparkR) kernels are running in a container somewhere on one of the mnt drives, because running

getwd

from SparkR notebook located at /home/notebook/work/ProjectName/NotebookInSparkR.ipynb returns

/mnt1/yarn/usercache/livy/appcache/application_1678485106748_0005/container_1678485106748_0005_01_000001,

however, running

setwd("/home/notebook")

from SparkR notebook located at /home/notebook/work/ProjectName/NotebookInSparkR.ipynb returns

[1] "Error in setwd(\"/home/notebook\"): cannot change working directory".

1

There are 1 best solutions below

2
Jakub Kaplan On

We don't use EMR Studio, but rather use SageMaker Studio following this set up: https://docs.aws.amazon.com/sagemaker/latest/dg/studio-notebooks-emr-cluster.html

But I have seen your problem as well. In particular, in my case, I was trying to read some data from S3 path s3://bucket/path/to/file and it would be telling me this does not exist even though I was dead sure it did exist (no typo, etc.). I swapped s3 for s3a and got a more informative error: That in fact the EMR cluster's EC2 roles does not have permissions.

So I think the easiest thing here for you to verify whether that's the case in your case as well would be to SSH onto the leader node (e.g. using Session Manager) and try to read s3://OurBucketName/Subdirectory/work/ProjectName/data/filename.csv from there. But if you are sure that S3 path exists, then I bet your case is the same as mine. You could also first try the thing I did, use the "s3a://..." path which uses a different (older) reader, but which should hopefully give you a more informative exception.