Applying sql-metadata in Python; sub-queries not working as expected

324 Views Asked by At

I'm trying to write a Python script which can take SQL code and generate create table statements for all tables CTEs created, as well as the final select. I'm using the sql-metadata package (https://pypi.org/project/sql-metadata/). I want to take those create table statements and easily port them into a visualization tool like diagrams.net to quickly spin up ERD-like visualizations of what information the query is grabbing. Just a little side project to brush up on Python (it's been a bit, sorry if the code is verbose), I'm thinking it could potentially help DB Admins/Analysts visually explain a bit more about the nuts/bolts of how they're thinking about a problem to less technical folks/management (or just help them communicate with other analysts faster).

I've written 3 test cases, and the 3rd (titled "sub_query") is not working properly. It doesn't seem to be quite catching the aliasing, as the last row in the create statement is calling for the name "value" even though I aliased the sub-query with the name "max_value". Any ideas on how I can fix this? Happy to give any more info that would be helpful!

I've attached the code block for the script below. I've tried playing with regexes, and I'm thinking maybe this can be solved recursively but I haven't had any luck yet:

from sql_metadata import Parser
from pprint import pprint

# Create example query
query_1 =  '''
with cte_1 as (
select id, a, b, c
from source_1),

cte_2 as (
select id, d, e, f as alias_test
from source_2)

select *
from cte_1 join cte_2
on cte_1.id = cte_2.id
'''

query_2 = '''
with cte_1 as (
select id, a, b, c
from source_1),

cte_2 as (
select id, d, e, f as alias_test
from source_2)

select *
from (select id, q from new_table where b < 3) a join cte_2
on a.id = cte_2.id
'''
# GIVING ME TROUBLE! The create table statement should be generating a column named "max_value", not just "value"!
sub_query = '''
with cte_1 as (
    select id, name from source_1
),

cte_2 as (
    select id, value from source_2
)

select cte_1.id, cte_1.name, (select max(value) from cte_2) as max_value
from cte_1
'''

I think we need to process all queries before the final select differently than the final query. For all CTEs, we can just grab the column names of the CTEs returned by the "with_names" method

def generate_creates(query):
    first_pass_parser = Parser(query)

    query_columns = first_pass_parser.columns
    if "*" in query_columns:
        query_columns.remove("*")
        
    query_ctes = first_pass_parser.with_names

    query_ctes_definitions = first_pass_parser.with_queries
    # Eventually, use nested dictionaries to contain more properties, like from, cleaner
    cte_column_dict = {}
    for cte in query_ctes_definitions:
        cte_column_dict[cte] = Parser(query_ctes_definitions[cte]).columns
    column_alias_dict = first_pass_parser.columns_aliases
    subqueries = first_pass_parser.subqueries



    # For the final query chunk (outside of all the CTEs), we'll separate it out,
    # give it a name, and process it separately
    final_query_code = query.split('\n\n')[-1]
    final_query_name = input("What should we name the final table?: ")
    print('\n')

    final_query_parser = Parser(final_query_code)

    my_lst = []
    for i in final_query_parser.tables:
        if i in cte_column_dict:
            my_lst += cte_column_dict[i]
    my_set = list(set(my_lst))


    final_query_dict = {}

    if "*" in final_query_code:
        final_query_dict[final_query_name] = my_set
    else:
        final_query_dict[final_query_name] = final_query_parser.columns


    Data_Type_Dict = {"1":"INT", "2":"DOUBLE", "3":"VARCHAR(255)", "4":"BOOLEAN", "5":"ARRAY", "6":"DICTIONARY"}


    # Now that we have each dictionary, we need to make create table statements for each
    for table in cte_column_dict:
        print(f"GENERATING CREATE STATEMENT FOR {table}")
        print('\n')
        print("DATA TYPE DICTIONARY: Choose the number that matches your column's data type")
        print('\n')
        print(Data_Type_Dict)
        print('\n')
        create_table_query = f"CREATE TABLE {table} (\n"
        for column_name in cte_column_dict[table]:
            for key, value in column_alias_dict.items():
                if column_name == value:
                    column_name = key
            while True:
                try:
                    data_type = Data_Type_Dict[input(f"select {column_name} data type: ")]
                    break
                except:
                    pass
                print("Please use a correct input!")
            create_table_query += f"    {column_name} {data_type},\n"
        create_table_query = create_table_query.rstrip(',\n') + "\n);"
        print('\n')
        print(create_table_query)
        print('\n')


    for table in final_query_dict:
        print(f"GENERATING CREATE STATEMENT FOR {table}")
        print('\n')
        print("DATA TYPE DICTIONARY: Choose the number that matches your column's data type")
        print('\n')
        print(Data_Type_Dict)
        print('\n')
        create_table_query = f"CREATE TABLE {table} (\n"
        for column_name in final_query_dict[table]:
            for key, value in column_alias_dict.items():
                if column_name == value:
                    column_name = key
            while True:
                try:
                    data_type = Data_Type_Dict[input(f"select {column_name} data type: ")]
                    break
                except:
                    pass
                print("Please use a correct input!")
            create_table_query += f"    {column_name} {data_type},\n"
        create_table_query = create_table_query.rstrip(',\n') + "\n);"
        print('\n')
        print(create_table_query)
   generate_creates(sub_query)

My output for sub_query (data type doesn't really matter, mainly names; I named the last table "final") looks like this:

CREATE TABLE cte_1 (
    id INT,
    name VARCHAR(255)
);

CREATE TABLE cte_2 (
    id INT,
    value DOUBLE
);

CREATE TABLE final (
    cte_1.id INT,
    cte_1.name VARCHAR(255),
    value DOUBLE
);

When it should look like this:

CREATE TABLE cte_1 (
    id INT,
    name VARCHAR(255)
);

CREATE TABLE cte_2 (
    id INT,
    value DOUBLE
);

CREATE TABLE final (
    cte_1.id INT,
    cte_1.name VARCHAR(255),
    **max_value DOUBLE**
);
0

There are 0 best solutions below