I am currently trying to find a way to convert a raw non structured (string) hive column definition to an iterable/traversable data object - in python.
Given this input
{
"name": "resource",
"type": "struct<attributes:struct<service.namespace:string,service.name:string>>"
}
I would like to produce an output somewhat similar to this
{
"name": "resource",
"type": "struct",
"fields": [
{
"name": "attributes",
"type": "struct",
"fields": [
{
"name": "service.namespace",
"type": "string"
},
{
"name": "service.name",
"type": "string"
}
]
}
]
}
Do some modifications to it (e.g. add a column) and dump it back, e.g. like this where resource.attributes
now contains a new column foo
with type string
{
"name": "resource",
"type": "struct<attributes:struct<service.namespace:string,service.name:string,foo:string>>"
}
Some additional concerns:
- I am not connected to any database at all, so no sql
- Using external dependencies is fine, as long as it does not come with a huge footprint (pyspark for one would be a no go)
- The round tripping should work fast (sub second)
Now, if I'm looking at this task, there are several things that directly pop up in my head, as f.i.:
- lexer & tokens
- parser & parse trees
- ASTs
I am quiet familiar with working with ASTs (writing custom code validation rules using node visitors, using node transformers for runtime code manipulation or rountripping using tools like astor) but I've never been that deeply down the rabbit hole to get in touch with things like lexers, parser, ply, yacc etc.
And I'm not quit sure how deep that rabbit hole I actually need to walk?
I'm not asking for a completely functional answer :) in essence, I would just like to know the concept and tools I really need to accomplish this task :)
In a best case scenario, I would like to write some code which provides that round-tripping functionality alongside some sort of typed class hierarchy with convenience functions (schema traversal, comparison etc.), to accomplish a control flow similar like this one:
- parse raw column type into some nested class hierarchy
- inspect the class hierarchy, do some modifications to it
- dump the result back to the unstructured string
I started some research and experimentation with lexers, tokens and parser, but I'm not quit sure how to apply those techniques to this challenge, and how much of that I really need