I'm focusing on the Where clause of a SQL statement and I'm trying to pull back all the columns used in the Where clause. Below is an example SQL:
sql_2 = """Select
PERS_ID
, STF_NO
, NAME
FROM
TEST T
WHERE T.JOIN_DT >= T.POSTING_DT'
AND T.PERS_ID LIKE '%123%'
AND T.LEAVE_DT BETWEEN CURRENT_DATE - 20 AND CURRENT_DATE - 1"""
The columns I am expecting are T.JOIN_DT, T.POSTING_DT, T.PERS_ID and T.LEAVE_DT. The code below is able to pull out all but T.PERS_ID
from __future__ import print_function
import re
import sqlparse
import numpy as np
from sqlparse.sql import IdentifierList, Identifier, Function, Where, Parenthesis, TokenList, Comparison, Operation
from sqlparse.tokens import Keyword, DML, Punctuation
sql_2 = """Select
PERS_ID
, STF_NO
, NAME
FROM
TEST T
WHERE T.JOIN_DT >= T.POSTING_DT'
AND T.PERS_ID LIKE '%123%'
AND T.LEAVE_DT BETWEEN CURRENT_DATE - 20 AND CURRENT_DATE - 1"""
parsed = sqlparse.parse(sql_2)[0]
where_columns = []
full_columns = []
for item in parsed.tokens:
if isinstance(item, Where):
for condition in item.tokens:
if isinstance(condition, Identifier):
where_columns.append(condition.get_parent_name())
where_columns.append(condition.get_real_name())
full_columns.append(where_columns)
where_columns = []
if isinstance(condition, Comparison):
for breakdown in condition.tokens:
if isinstance(breakdown, Identifier):
where_columns.append(breakdown.get_parent_name())
where_columns.append(breakdown.get_real_name())
full_columns.append(where_columns)
where_columns = []
print(full_columns)
It seems to group the 'AND P.PERS_ID LIKE '%123' into one token, and I'm unsure how to break it down further to get the identifier P.PERS_ID. If you add print(condition)
after for condition in item.tokens:
you'll see what I mean.
The issue is that you have a trailing quote after T.POSTING_DT
My library SQLGlot however is able to easily extract out the columns.