AWS Glue: Rename_field() does not work after relationalize

2.9k Views Asked by At

I got a job that needs to perform the following task

  1. Relationalize the data
  2. Rename the field names that contains the '.'s so that it can be imported into PostgreSQL as normal looking field name.

Here is the code

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "gluecatalog", table_name = "fcorders", transformation_ctx = "datasource0")
rootTableName = 'orders' 

dfc = Relationalize.apply(frame = datasource0, staging_path = "s3://my-bucket/temp/", name = rootTableName, transformation_ctx = "dfc")
dfc.keys()
for df_name in dfc.keys():
        m_df = dfc.select(df_name)
        print "Writing to Postgre table: ", df_name
        if (df_name <> rootTableName):
            renamefields4 = m_df.rename_field("SalesDeliveryLines.val.shipped.unitDisplayCode", "shipped_unitDisplayCode")
        else:
            renamefields4 = RenameField.apply(frame = m_df, old_name = "vehicle.sourceReccordUID", new_name = "vehicle_sourceReccordUID", transformation_ctx = "renamefields4")
        renamefields4.printSchema()

The printSchema() displays the schema as unchanged. If I write to the database, the field names still contain '.'s.

If I uses ApplyMapping.apply() to change the field name before relationalize, it makes the child table disappear. If I use ApplyMapping.apply() after relationalize it simply deletes all the fields whose name contains '.'.

The bottom line is I cannot relationalize and rename field in the same job no matter what I try.

Did I miss something or is this a AWS Glue bug?

2

There are 2 best solutions below

1
On BEST ANSWER

It is confirmed the malfunction of rename_field() and RenameField.apply() is a Glue bug.

The work-around I have so far is to convert DynamicFrame to DataFrame -> rename the fields DataFrame -> Convert it back to DynamicFrame.

Here is the code

    new_df = m_df.toDF()
    print (type( new_df))
    for oldName in new_df.schema.names:
      new_df = new_df.withColumnRenamed(oldName, oldName.replace("SalesDeliveryLines.val.","").replace(".","_"))
    m_df = m_df.fromDF(new_df, glueContext, "m_df")
1
On

You need to place back-ticks around the field name:

m_df.rename_field("`SalesDeliveryLines.val.shipped.unitDisplayCode`", "shipped_unitDisplayCode")

You could find more info on the AWS Glue documentation