, "keyContext": "keyContextValue": "https://access.dummy.com/training/date/last_modified_dtm"" /> , "keyContext": "keyContextValue": "https://access.dummy.com/training/date/last_modified_dtm"" /> , "keyContext": "keyContextValue": "https://access.dummy.com/training/date/last_modified_dtm""/>

Extract key value pair in Spark SQL where the key is a URL and value is a string

56 Views Asked by At

I have a JSON object as below

"data": [
    {
        "keyId": <guid>,
        "keyContext":
            "keyContextValue":
                "https://access.dummy.com/training/date/last_modified_dtm":"2024-02-01 17:02:41",
                "https://access.dummy.com/training/id/training_id":"300",
                "https://access.dummy.com/training/name/training_name":"Dummy Training"
    }
]

I am able to extract the values until KeyContext:KeyContextValue Array using get_json_object function in Spark SQL. But My requirement is to extract the actual values in this array to show as below.

keyId last_modified_dtm training_id training_name
2024-02-01 300 Dummy Training

Since the key is a URL, I am not sure how to extract the value. Can someone please assist on how to use Spark SQL to achieve this.

1

There are 1 best solutions below

0
s.polam On

Please find the simple logic. tried in differently.

All you need is

  • Schema to extract required columns from json object.

    'data ARRAY<STRUCT<keyContext: STRUCT<keyContextValue: MAP<STRING, STRING>>, keyId: STRING>>'

  • Schema for final output columns.

    'last_modified_dtm timestamp, training_id int, training_name string'

And then make use of built in functions like

  • inline to flatten array
  • from_json to extract columns from json
  • from_csv to convert desired output.
SELECT    
    inline(
        transform(
            from_json(
                data, 
                'data ARRAY<STRUCT<keyContext: STRUCT<keyContextValue: MAP<STRING, STRING>>, keyId: STRING>>'
            ).data.keyContext.keyContextValue,
            e -> from_csv(
                concat_ws(',', map_values(e)),
                'last_modified_dtm timestamp, training_id int, training_name string'
            )
        )
    )
FROM <table_name>