I'm trying to write a process using SQLParse which lists the tables present in an SQL statement, currently focusing on the 'FROM' clause of the query. I'm also trying identify nested queries (or subqueries) within the FROM clause and run the process again to identify tables within that nested query.
Using this example query
from sqlparse.sql import IdentifierList, Identifier, Function, Where, Parenthesis, TokenList
from sqlparse.tokens import Keyword, DML, Punctuation
sql_2 = """select * from luv_main.test_table left join (select * from luv_all.fake_Table where (a = b)) x where a = 4 order by A, B, C"""
Below is the code, which is working:
full_tables = []
tables = []
from_seen = False
for item in parsed.tokens:
#stop the process if the Where statement is reached
if isinstance(item, Where):
from_seen = False
if from_seen:
#multiple tables with Join statements in between, or one table. Doesn't consider subqueries
if isinstance(item, Identifier):
#checks to see if there is a parenthesis, meaning a subquery
if 'SELECT' in item.value.upper():
subquery = item.value
#returns the db name
tables.append(item.get_parent_name())
#returns the table name
tables.append(item.get_real_name())
#returns the alias
tables.append(item.get_alias())
full_tables.append(tables)
tables = []
# if multiple tables separated by comma's will be an identifier list. Doesn't consider subqueries
if isinstance(item, IdentifierList):
for identifier in item.get_identifiers():
#returns the db name
tables.append(identifier.get_parent_name())
#returns the table name
tables.append(identifier.get_real_name())
#returns the alias
tables.append(identifier.get_alias())
full_tables.append(tables)
tables = []
else:
if item.ttype is Keyword and item.value.upper() == 'FROM':
from_seen = True
print(full_tables)
print(len(full_tables))
This starts with the query as is, and also identifies the subquery by searching for the word select, I then have this.
#process of removing outer-most parentheses and identifying aliases that sit outside that window
#new subquery string ready to parse
res_sub = ""
#capture the alias
alias = ""
#record the number of parentheses as they open and close
paren_cnt = 0
for char in subquery:
#if ( and there's already been a ( , include it
if char == '(' and paren_cnt > 0:
res_sub += char
#if (, add to the count
if char == '(':
paren_cnt += 1
# if ) and there's at least 2 (, include it
if char == ')' and paren_cnt > 1:
res_sub += char
# if ), subtract from the count
if char == ')':
paren_cnt -= 1
# capture the script
if char != '(' and char != ')' and paren_cnt >0:
res_sub += char
# capture the alias
if char != '(' and char != ')' and char != ' ' and paren_cnt == 0:
alias += char
subparsed = sqlparse.parse(res_sub)[0]
This then removes the outermost parentheses and parses as a new SQL statement. This is all working, and if I run this parsed statement through the previous block of code manually it works as expected.
I've then tried to put this into separate functions:
- First parses the query and calls:
- A function that scans the FROM clause and returns the tables, but if it identifies a subquery, it calls:
- A function which removes the outermost parentheses of the script and then calls the first function to send it back through the process.
But what happens is when it tries to run sqlparse.parse(res_sub)[0]
is tuple index out of range. It shouldn't be a tuple, it should be a str which is then parsed into sqlparse.sql.Statement.
I don't understand why it's behaving differently just because I've put it into a series of functions. Functions code below:
def parse(sql):
parsed = sqlparse.parse(sql)[0]
#call function to assess the FROM statement of the query
assess_from_clause(parsed)
def assess_from_clause(parsed):
full_tables = []
tables = []
from_seen = False
for item in parsed.tokens:
#stop the process if the Where statement is reached
if isinstance(item, Where):
from_seen = False
#checks to see if there is a parenthesis, meaning a subquery
if 'SELECT' in item.value.upper():
subquery = item.value
subquery_parsing(subquery)
if from_seen:
#multiple tables with Join statements in between, or one table. Doesn't consider subqueries
if isinstance(item, Identifier):
#returns the db name
tables.append(item.get_parent_name())
#returns the table name
tables.append(item.get_real_name())
#returns the alias
tables.append(item.get_alias())
full_tables.append(tables)
tables = []
# if multiple tables separated by comma's will be an identifier list. Doesn't consider subqueries
if isinstance(item, IdentifierList):
for identifier in item.get_identifiers():
#returns the db name
tables.append(identifier.get_parent_name())
#returns the table name
tables.append(identifier.get_real_name())
#returns the alias
tables.append(identifier.get_alias())
full_tables.append(tables)
tables = []
else:
if item.ttype is Keyword and item.value.upper() == 'FROM':
from_seen = True
print(full_tables)
def subquery_parsing(subquery):
#new subquery string ready to parse
res_sub = ''
#capture the alias
alias = ''
#record the number of parentheses as they open and close
paren_cnt = 0
for char in subquery:
#if ( and there's already been a ( , include it
if char == '(' and paren_cnt > 0:
res_sub += char
#if (, add to the count
if char == '(':
paren_cnt += 1
# if ) and there's at least 2 (, include it
if char == ')' and paren_cnt > 1:
res_sub += char
# if ), subtract from the count
if char == ')':
paren_cnt -= 1
# capture the script
if char != '(' and char != ')' and paren_cnt >0:
res_sub += char
# capture the alias
if char != '(' and char != ')' and char != ' ' and paren_cnt == 0:
alias += char
parse(res_sub)
I should stress I'm not proficient in Python and very much learning as I go!
Thanks
I believe I've resolved it now, the section that triggers the third function was triggering too early and not parsing the subquery of the code.
I've changed from this:
To this:
Sorry it's very much trial and error learning for me at the moment, thanks Barmar for your comments.