I need to process some data of a certain flow that I have in a specific folder in a bucket S3
. I want to do this in Python
. After searching for a while I found the library PyAthena
which exactly what I was looking for!
I installed the version 1.8.0
of PyAthena
.
For your information, I have my S3
bucket in the region of Paris eu-west-3
and my Athena
database in the region of Francfort eu-central-1
.
I used the following code that I found in the documentation PyAthena Doc:
from pyathena import connect
cursor = connect(aws_access_key_id='YOUR_ACCESS_KEY_ID',
aws_secret_access_key='YOUR_SECRET_ACCESS_KEY',
s3_staging_dir='s3://YOUR_S3_BUCKET/path/to/',
region_name='us-west-2').cursor()
cursor.execute("SELECT * FROM one_row")
print(cursor.description)
print(cursor.fetchall())
I was at first not sure about which region_name
to use, if it should be the one of Paris
where the S3
bucket is, or the one of Francfort
where the Athena
database is!!
I tried both and following the error messages I got, I ended up using the one of my S3
bucket!
However I kept getting errors of permissions about Glue
, something like:
pyathena.error.OperationalError: Insufficient permissions to execute the query. Error retrieving table : master in database : default due to : User: arn:aws:iam::<my-account-client-ID>:user/s3-test is not authorized to perform: glue:GetTable on resource: arn:aws:glue:eu-west-3:<my-account-client-ID>:catalog
So I added the following strategy in the IAM
:
{
"Sid": "VisualEditor2",
"Effect": "Allow",
"Action": [
"athena:StartQueryExecution",
"athena:GetQueryResultsStream",
"athena:GetQueryResults",
"athena:DeleteNamedQuery",
"athena:GetNamedQuery",
"athena:*",
"athena:ListQueryExecutions",
"athena:ListNamedQueries",
"athena:CreateNamedQuery",
"athena:StopQueryExecution",
"athena:GetQueryExecution",
"athena:BatchGetNamedQuery",
"athena:BatchGetQueryExecution"
],
"Resource": "*"
},
{
"Sid": "VisualEditor3",
"Effect": "Allow",
"Action": [
"glue:GetTable",
"glue:GetTables",
"glue:GetDatabase"
],
"Resource": [
"arn:aws:glue:eu-west-3:<my-account-client-ID>:catalog",
"arn:aws:glue:eu-west-3:<my-account-client-ID>:database/*",
"arn:aws:glue:eu-west-3:<my-account-client-ID>:table/*/*"
]
}
Now I have this error message:
cursor.execute("select * from master")
File "/home/ubuntu/.local/lib/python3.6/site-packages/pyathena/util.py", line 28, in _wrapper
return wrapped(*args, **kwargs)
File "/home/ubuntu/.local/lib/python3.6/site-packages/pyathena/cursor.py", line 57, in execute
raise OperationalError(query_execution.state_change_reason)
pyathena.error.OperationalError: SYNTAX_ERROR: line 1:15: Schema default does not exist
The problem is the select statement: if you do not indicate it you will use the database default and if you do not have such database in your environment it will fail. You should indicate your database and table:
Or you could also specify the database name (or schema name) with a parameter in the cursor function:
If you do one of that you should not get the same error anymore.