Glue PySpark - Schema Validation of each record in a dynamicframe

52 Views Asked by At

I have to validate the schema of files in dynamic frame that I am reading from S3 to Glue.

How do I efficiently validate the schema of every record ?

I tried to validate the schema by converting them to dataframe using below: and use jsonschema python library to validate with a schema.

for record in dynamic_frame.toDF().collect():
    record = record.asDict()
    jsonschema.validate(record,schema)

but the flaw here is if a record has columns

[A:1,B:2,C:3] 

and another has

[A:11,B:22,C:33,D:44] 

then converting them to dataframe makes the first record as

[A:1,B:2,C:3,D:None] 

but orginally the first didn't have that "D" column, which makes schema validation difficult.

Here I want to check datatype changes, additional columns check , length checks in the schema validation

please help me here, any suggestions are welcome. Thanks

I have JSON files in S3.

0

There are 0 best solutions below