Glue PySpark - Schema Validation of each record in a dynamicframe

52 Views Asked by puzzleheaded_1910 At 19 February 2024 at 13:37

I have to validate the schema of files in dynamic frame that I am reading from S3 to Glue.

How do I efficiently validate the schema of every record ?

I tried to validate the schema by converting them to dataframe using below: and use jsonschema python library to validate with a schema.

for record in dynamic_frame.toDF().collect():
    record = record.asDict()
    jsonschema.validate(record,schema)

but the flaw here is if a record has columns

[A:1,B:2,C:3]

and another has

[A:11,B:22,C:33,D:44]

then converting them to dataframe makes the first record as

[A:1,B:2,C:3,D:None]

but orginally the first didn't have that "D" column, which makes schema validation difficult.

Here I want to check datatype changes, additional columns check , length checks in the schema validation

please help me here, any suggestions are welcome. Thanks

I have JSON files in S3.

There are 0 best solutions below