Amazon Glue - load to Redshift failures with decimal fields

Question

Amazon Glue - load to Redshift failures with decimal fields

64 Views Asked by Michał Malus At 17 March 2024 at 20:27

I have a very simple Glue job loading data from S3 to Redshift, with a Transform in between to rename fields and change their types:

First execution runs (almost) without issues - data gets loaded into Redshift. All following executions fail. The reason is, Glue properly creates the Redshift table (first load) but incorrectly handles it when it already exists.

This happens for all fields converted to decimal (did not test all other types though).

CSV file:

Text value,Average whatever,Another string,Just a number
A1,2.2,test,5
A2,5,test2,7

Transform (change schema):

Generated code (I did not edit the code, it's still a "Visual" job):

...
# Script generated for node Amazon S3
AmazonS3_node1710618800725 = glueContext.create_dynamic_frame.from_options(format_options={"quoteChar": "\"", "withHeader": True, "separator": ","}, connection_type="s3", format="csv", connection_options={"paths": ["s3://<source-s3-bucket>/test/gonna_fail/data.csv"]}, transformation_ctx="AmazonS3_node1710618800725")

# Script generated for node Change Schema
ChangeSchema_node1710691042153 = ApplyMapping.apply(frame=AmazonS3_node1710618800725, mappings=[("Text value", "string", "text_value", "string"), ("Average whatever", "string", "average_whatever", "decimal"), ("Another string", "string", "another_string", "string"), ("Just a number", "string", "just_a_number", "decimal")], transformation_ctx="ChangeSchema_node1710691042153")

# Script generated for node Amazon Redshift
AmazonRedshift_node1710618808047 = glueContext.write_dynamic_frame.from_options(frame=ChangeSchema_node1710691042153, connection_type="redshift", connection_options={"redshiftTmpDir": "s3://aws-glue-assets-xxx-eu-central-1/temporary/", "useConnectionProperties": "true", "dbtable": "raw_data.gonna_fail", "connectionName": "serverless-redshift", "preactions": "DROP TABLE IF EXISTS raw_data.gonna_fail; CREATE TABLE IF NOT EXISTS raw_data.gonna_fail (text_value VARCHAR, average_whatever DECIMAL, another_string VARCHAR, just_a_number DECIMAL);"}, transformation_ctx="AmazonRedshift_node1710618808047")

First execution (all queries come from Redshift's log):

7:26:26 PM  CREATE TABLE IF NOT EXISTS "raw_data"."gonna_fail" ("text_value" VARCHAR(MAX), "average_whatever" DECIMAL(10,2), "another_string" VARCHAR(MAX), "just_a_number" DECIMAL(10,2)) DISTSTYLE EVEN

Table gets properly created. After that there is an error - "Expected command status BEGIN, got CREATE TABLE" - (how to avoid it?) but the job retries successfully after 30 seconds:

7:26:56 PM  CREATE TABLE IF NOT EXISTS "raw_data"."gonna_fail" ("text_value" VARCHAR(MAX), "average_whatever" DECIMAL(10,2), "another_string" VARCHAR(MAX), "just_a_number" DECIMAL(10,2)) DISTSTYLE EVEN 
7:26:56 PM  DROP TABLE IF EXISTS raw_data.gonna_fail 
7:26:56 PM   CREATE TABLE IF NOT EXISTS raw_data.gonna_fail (text_value VARCHAR, average_whatever DECIMAL, another_string VARCHAR, just_a_number DECIMAL) 
7:26:56 PM  COPY "raw_data"."gonna_fail" ("text_value","average_whatever","another_string","just_a_number") FROM 's3://aws-glue-assets-xxx-eu-central-1/temporary/63e30430-67f0-4ab2-b539-22180ae2920b/manifest.json' FORMAT AS CSV NULL AS '@NULL@' manifest CREDENTIALS ''

Second execution:

For each decimal field a new one is created, with a name being concatenated name and data type:

7:29:19 PM  ALTER TABLE raw_data.gonna_fail add "average_whatever_decimal(10,2)" DECIMAL(10,2) default NULL; 
7:29:19 PM   ALTER TABLE raw_data.gonna_fail add "just_a_number_decimal(10,2)" DECIMAL(10,2) default NULL;

This load also fails (did not check why) and retries after 30 seconds:

Create table gets executed (not sure why create statement is executed two times, "automatically", and using preactions):

7:29:54 PM  CREATE TABLE IF NOT EXISTS "raw_data"."gonna_fail" ("text_value" VARCHAR(MAX), "average_whatever" DECIMAL(10,2), "another_string" VARCHAR(MAX), "just_a_number" DECIMAL(10,2), "just_a_number_decimal(10,2)" DECIMAL(10,2), "average_whatever_decimal(10,2)" DECIMAL(10,2)) DISTSTYLE EVEN

Preactions:

7:29:54 PM  DROP TABLE IF EXISTS raw_data.gonna_fail 
7:29:54 PM   CREATE TABLE IF NOT EXISTS raw_data.gonna_fail (text_value VARCHAR, average_whatever DECIMAL, another_string VARCHAR, just_a_number DECIMAL)

Incorrect copy statement:

7:29:54 PM COPY "raw_data"."gonna_fail" ("text_value","average_whatever","another_string","just_a_number","just_a_number_decimal(10,2)","average_whatever_decimal(10,2)") FROM 's3://aws-glue-assets-xxx-eu-central-1/temporary/d46ca4ae-86cc-4444-addd-6c54c376a2a1/manifest.json' FORMAT AS CSV NULL AS '@NULL@' manifest CREDENTIALS ''

This operations fails, Spark retries 3 times and fails the load. Error visible in Glue:

Caused by: com.amazon.redshift.util.RedshiftException: ERROR: column "just_a_number_decimal(10,2)" of relation "gonna_fail" does not exist

I did not find those additional/incorrect fields in the frame's .schema().fields.

Original Q&A

There are 1 best solutions below

**Bogdan** · Answer 1 · 2024-03-22T13:28:35.503000

I dont think there is much you can do from the visual tool. Clone the job and update the script as following :

AmazonRedshift_node1710618808047 = glueContext.write_dynamic_frame.from_options(frame=ChangeSchema_node1710691042153, 
connection_type="redshift", 
connection_options={"redshiftTmpDir": "s3://aws-glue-assets-xxx-eu-central-1/temporary/", 
"dbtable": "raw_data.temp_gonna_fail", 
"connectionName": "serverless-redshift", 
"preactions": "DROP TABLE IF EXISTS raw_data.gonna_fail; CREATE TABLE IF NOT EXISTS raw_data.gonna_fail (text_value VARCHAR, average_whatever DECIMAL, another_string VARCHAR, just_a_number DECIMAL);",
"postactions": "BEGIN; INSERT INTO raw_data.gonna_fail SELECT  * from raw_data.temp_gonna_fail; drop table if exists raw_data.temp_gonna_fail; END;"
}, 
transformation_ctx="AmazonRedshift_node1710618808047"),

Added "postactions" to the connection_options and removed "useConnectionProperties": "true" (if the script fails because of this, add it again)

Amazon Glue - load to Redshift failures with decimal fields

There are 1 best solutions below

Related Questions in AMAZON-REDSHIFT

Related Questions in AWS-GLUE

Related Questions in AMAZON-REDSHIFT-SERVERLESS

Trending Questions

Popular # Hahtags

Popular Questions