I'm trying to set up a connection to AWS Redshift from the Great Expectations Framework (GE) according to the tutorial using Python and facing two issues:
- When I'm using
postgresql+psycopg2
as driver in the connection string in step 5, adding the datasource (context.add_datasource(**datasource_config)
) takes extremely long (up to 20 minutes !!!). Validating expectations afterwards works as expected and even runs quite fast. I'm assuming the huge amount of time needed is due to the size of the redshift cluster I'm connecting to (more than 1000 schemas) and thepostgresql
driver not being optimized for redshift. - In search for alternatives to the
postgresql
driver I came across thesqlalchemy-redshift
driver. Changing it in the connection string (redshift+psycopg2
) adds the datasource instantly, however, validating some expectations (e.g.expect_column_values_to_not_be_null
) fails! After some digging through the code I realized it might be due to GE creating a temporary table in the SQL query. So when I specify the query:
GE actually seems to run something like:select * from my_redshift_schema.my_table;
For certain expectationsCREATE TEMPORARY TABLE "ge_temp_bf3cbfa2" AS select * from my_redshift_schema.my_table;
sqlalchemy-redshift
tries to find information about the columns of the table, however, it searches for the name of the temporary table and not the actual one I specified in the SQL query. It consequently fails as it obviously can't find a table with that name in the redshift cluster. More specifically it results in aKeyError
in thedialect.py
file withinsqlalchemy-redshift
:.venv/lib/python3.8/site-packages/sqlalchemy_redshift/dialect.py\", line 819, in _get_redshift_columns return all_schema_columns[key] KeyError: RelationKey(name='ge_temp_bf3cbfa2', schema='public')
Has anyone succeeded running GE on redshift? How could I mitigate the issues I'm facing (make option 1 faster or fix the error in option 2)?