Tuple Index Out of Range - but only when run in a function

258 Views Asked by At

I'm trying to write a process using SQLParse which lists the tables present in an SQL statement, currently focusing on the 'FROM' clause of the query. I'm also trying identify nested queries (or subqueries) within the FROM clause and run the process again to identify tables within that nested query.

Using this example query

from sqlparse.sql import IdentifierList, Identifier, Function, Where, Parenthesis, TokenList
from sqlparse.tokens import Keyword, DML, Punctuation

sql_2 = """select * from luv_main.test_table left join (select * from luv_all.fake_Table where (a = b)) x  where a = 4 order by A, B, C"""

Below is the code, which is working:

full_tables = []
tables = []

from_seen = False
for item in parsed.tokens:
    
    #stop the process if the Where statement is reached
    if isinstance(item, Where):
            from_seen = False
    
    if from_seen:
     
        #multiple tables with Join statements in between, or one table. Doesn't consider subqueries
        if isinstance(item, Identifier):
            
            #checks to see if there is a parenthesis, meaning a subquery 
            if 'SELECT' in item.value.upper():
                subquery = item.value
                
            #returns the db name 
            tables.append(item.get_parent_name())
            
            #returns the table name
            tables.append(item.get_real_name())

            #returns the alias
            tables.append(item.get_alias())
            
            full_tables.append(tables)
            tables = []
        
        
        # if multiple tables separated by comma's will be an identifier list. Doesn't consider subqueries
        if isinstance(item, IdentifierList):
            for identifier in item.get_identifiers():
                #returns the db name
                tables.append(identifier.get_parent_name())
                
                #returns the table name
                tables.append(identifier.get_real_name())
                
                #returns the alias
                tables.append(identifier.get_alias())
                
                full_tables.append(tables)
                tables = []
                 
    else:
        if item.ttype is Keyword and item.value.upper() == 'FROM':
            from_seen = True
       
print(full_tables)
print(len(full_tables))

This starts with the query as is, and also identifies the subquery by searching for the word select, I then have this.

#process of removing outer-most parentheses and identifying aliases that sit outside that window

#new subquery string ready to parse
res_sub = ""

#capture the alias
alias = ""

#record the number of parentheses as they open and close
paren_cnt = 0


for char in subquery:
    
    #if ( and there's already been a ( , include it
    if char == '(' and paren_cnt > 0:
        res_sub += char
    
    #if (, add to the count
    if char == '(':
        paren_cnt += 1
   
    # if ) and there's at least 2 (, include it
    if char == ')' and paren_cnt > 1:
        res_sub += char
          
    # if ), subtract from the count        
    if char == ')':
        paren_cnt -= 1
    
    # capture the script
    if char != '(' and char != ')' and paren_cnt >0:
        res_sub += char
    
    # capture the alias
    if char != '(' and char != ')'  and char != ' ' and paren_cnt == 0:
        alias += char
        
subparsed = sqlparse.parse(res_sub)[0]

This then removes the outermost parentheses and parses as a new SQL statement. This is all working, and if I run this parsed statement through the previous block of code manually it works as expected.

I've then tried to put this into separate functions:

  • First parses the query and calls:
  • A function that scans the FROM clause and returns the tables, but if it identifies a subquery, it calls:
  • A function which removes the outermost parentheses of the script and then calls the first function to send it back through the process.

But what happens is when it tries to run sqlparse.parse(res_sub)[0] is tuple index out of range. It shouldn't be a tuple, it should be a str which is then parsed into sqlparse.sql.Statement.

I don't understand why it's behaving differently just because I've put it into a series of functions. Functions code below:

def parse(sql):
    
    parsed = sqlparse.parse(sql)[0]
        
    #call function to assess the FROM statement of the query
    assess_from_clause(parsed)

def assess_from_clause(parsed):
    
    full_tables = []
    tables = []
    
    from_seen = False
    for item in parsed.tokens:
        #stop the process if the Where statement is reached
        if isinstance(item, Where):
            from_seen = False
        
        #checks to see if there is a parenthesis, meaning a subquery 
        if 'SELECT' in item.value.upper():
            subquery = item.value
            subquery_parsing(subquery)
        
        if from_seen:
            #multiple tables with Join statements in between, or one table. Doesn't consider subqueries
            if isinstance(item, Identifier):
                
                #returns the db name 
                tables.append(item.get_parent_name())
            
                #returns the table name
                tables.append(item.get_real_name())

                #returns the alias
                tables.append(item.get_alias())
            
                full_tables.append(tables)
                tables = []
        
        
            # if multiple tables separated by comma's will be an identifier list. Doesn't consider subqueries
            if isinstance(item, IdentifierList):
                for identifier in item.get_identifiers():
                    #returns the db name
                    tables.append(identifier.get_parent_name())
                
                    #returns the table name
                    tables.append(identifier.get_real_name())
                
                    #returns the alias
                    tables.append(identifier.get_alias())
                
                    full_tables.append(tables)
                    tables = []
                 
        else:
            if item.ttype is Keyword and item.value.upper() == 'FROM':
                from_seen = True
       
    print(full_tables)

def subquery_parsing(subquery):
    
    #new subquery string ready to parse
    res_sub = ''

    #capture the alias
    alias = ''

    #record the number of parentheses as they open and close
    paren_cnt = 0


    for char in subquery:
        #if ( and there's already been a ( , include it
        if char == '(' and paren_cnt > 0:
            res_sub += char
    
        #if (, add to the count
        if char == '(':
            paren_cnt += 1
   
        # if ) and there's at least 2 (, include it
        if char == ')' and paren_cnt > 1:
            res_sub += char
          
        # if ), subtract from the count        
        if char == ')':
            paren_cnt -= 1
    
        # capture the script
        if char != '(' and char != ')' and paren_cnt >0:
            res_sub += char
    
        # capture the alias
        if char != '(' and char != ')'  and char != ' ' and paren_cnt == 0:
            alias += char
    
    parse(res_sub)

I should stress I'm not proficient in Python and very much learning as I go!

Thanks

2

There are 2 best solutions below

0
On

I believe I've resolved it now, the section that triggers the third function was triggering too early and not parsing the subquery of the code.

I've changed from this:

def assess_from_clause(parsed):
    
    full_tables = []
    tables = []
    
    from_seen = False
    for item in parsed.tokens:
        #stop the process if the Where statement is reached
        if isinstance(item, Where):
            from_seen = False
        
        #checks to see if there is a parenthesis, meaning a subquery 
        if 'SELECT' in item.value.upper():
            subquery = item.value
            subquery_parsing(subquery)
        
        if from_seen:

To this:

def assess_from_clause(parsed):
    
    full_tables = []
    tables = []
    
    from_seen = False
    for item in parsed.tokens:
        #stop the process if the Where statement is reached
        if isinstance(item, Where):
            from_seen = False
        
        if from_seen:
        
            #checks to see if there is a parenthesis, meaning a subquery 
            if 'SELECT' in item.value.upper():
                subquery = item.value
                subquery_parsing(subquery)

Sorry it's very much trial and error learning for me at the moment, thanks Barmar for your comments.

0
On

This is trivial with my library SQLGlot

import sqlglot
import sqlglot.expressions as exp

sql = """select * from luv_main.test_table left join (select * from luv_all.fake_Table where (a = b)) x  where a = 4 order by A, B, C"""

for column in sqlglot.parse_one(sql).find_all(exp.Table):
    print(column.text("this"))

fake_Table
test_table