Is there a good way to parse sql case statement?

625 Views Asked by At

The goal is to parse the sql statement in Python, and I have some difficulties using sqlparse to parse the case statement. I have tried the method in this link Parse CASE WHEN statements with sqlparse, but the result becomes messy when the data set becomes large. For example, if the case statement is like:

CASE
           WHEN ticket.status IN ('new',
                                  'assigned')
                AND ticket_custom.ticket = 3 THEN “Win”

the parsed result is shown as below:

([<Keyword 'WHEN' at 0x1870AFBA6A8>, <Whitespace ' ' at 0x1870AFBA708>, <Comparison 'ticket...' at 0x1870AFF55C8>, <Newline ' ' at 0x1870AFBC8E8>, <Whitespace ' ' at 0x1870AFBC948>, <Whitespace ' ' at 0x1870AFBC9A8>, <Whitespace ' ' 
at 0x1870AFBCA08>, <Whitespace ' ' at 0x1870AFBCA68>, <Whitespace ' ' at 0x1870AFBCAC8>, <Whitespace ' ' at 0x1870AFBCB28>, <Whitespace ' ' at 0x1870AFBCB88>, <Whitespace ' ' at 0x1870AFBCBE8>, <Whitespace ' ' at 0x1870AFBCC48>, <Whitespace ' ' at 0x1870AFBCCA8>, 
<Whitespace ' ' at 0x1870AFBCD08>, <Whitespace ' ' at 0x1870AFBCD68>, <Whitespace ' ' at 0x1870AFBCDC8>, <Whitespace ' ' at 0x1870AFBCE28>, <Whitespace ' ' at 0x1870AFBCE88>, <Whitespace ' ' at 0x1870AFBCEE8>, <Keyword 'AND' at 0x1870AFBCF48>, <Whitespace ' ' at 0x1870AFBCFA8>, <Comparison 'ticket...' at 0x1870AFF5648>, <Whitespace ' ' at 0x1870AFBD2E8>], [<Keyword 'THEN' at 0x1870AFBD348>, <Whitespace ' ' at 0x1870AFBD3A8>, <Error '“' at 0x1870AFBD408>, <Identifier 'Win' at 0x1870AFEF048>, <Error '”' at 0x1870AFBD4C8>, <Newline ' ' at 0x1870AFBD528>, <Whitespace ' ' at 0x1870AFBD588>, <Whitespace ' ' at 0x1870AFBD5E8>, <Whitespace ' ' at 0x1870AFBD648>, <Whitespace ' ' at 0x1870AFBD6A8>, <Whitespace ' ' at 0x1870AFBD708>, <Whitespace ' ' at 0x1870AFBD768>, <Whitespace ' ' at 0x1870AFBD7C8>, <Whitespace ' ' at 0x1870AFBD828>, <Whitespace ' ' at 0x1870AFBD888>, <Whitespace ' ' at 0x1870AFBD8E8>, <Whitespace ' ' 
at 0x1870AFBD948>]), 

You can imagine what the result would become when the sql statement becomes complex. And the result I would like to have is actually similiar to the way that sql_metadata parse the subqueries:

{'case statement name':"case statement content" }

the sql_metadata and sqlparse could parse subqueries like this but not for the case statement, is there any other way to do that in Python?

Here is the complex example:

SELECT '%c%' AS Chapter = 
    CASE
           WHEN ticket.status IN ('new',
                                  'assigned')
                AND ticket_custom.ticket = 3 THEN “Win”
           WHEN ticket.status IN ('new',
                                  'assigned')
                AND ticket_custom.ticket = 2 THEN “Lose”
           WHEN ticket.status IN ('new',
                                  'assigned')
                AND ticket_custom.ticket = 1 THEN “Draw”
           WHEN ticket.status IN ('new',
                                  'assigned')
                AND EmployeeConsolidated.id= 49887 THEN “Undefined”
       END AS Feature,
       ticket_custom.ticket,
       ticket_custom.mn,
       ticket_custom.value,
       ticket_custom.category,
       SUM(CASE
               WHEN ticket.status IN ('new', 'assigned') THEN 1
               ELSE 0
           END) AS NEW,
       SUM(CASE
               WHEN ticket.status IN ('new', 'assigned') THEN 1
               ELSE 0
           END) AS OLD,
       SUM(CASE
               WHEN ticket.status='closed' THEN 1
               ELSE 0
           END) AS 'Closed',
       count(id) AS Total,
       EmployeeConsolidated.id,
       EmployeeConsolidated.rate,
       ticket.id AS _id
FROM engine.ticket
INNER JOIN engine.ticket_custom ON ticket.id = ticket_custom.ticket
LEFT JOIN
  (SELECT e.employee_id AS "Employee #" ,
          e.first_name || ' ' || e.last_name AS "Name" ,
          e.email AS "Email" ,
          e.phone_number AS "Phone" ,
          TO_CHAR(e.hire_date, 'MM/DD/YYYY') AS "Hire Date" ,
          TO_CHAR(e.salary, 'L99G999D99', 'NLS_NUMERIC_CHARACTERS = ''.,'' NLS_CURRENCY = ''$''') AS "Salary" ,
          e.commission_pct AS "Comission %" ,
          'works as ' || j.job_title || ' in ' || d.department_name || ' department (manager: ' || dm.first_name || ' ' || dm.last_name || ') and immediate supervisor: ' || m.first_name || ' ' || m.last_name AS "Current Job" ,
          TO_CHAR(j.min_salary, 'L99G999D99', 'NLS_NUMERIC_CHARACTERS = ''.,'' NLS_CURRENCY = ''$''') || ' - ' || TO_CHAR(j.max_salary, 'L99G999D99', 'NLS_NUMERIC_CHARACTERS = ''.,'' NLS_CURRENCY = ''$''') AS "Current Salary" ,
          l.street_address || ', ' || l.postal_code || ', ' || l.city || ', ' || l.state_province || ', ' || c.country_name || ' (' || r.region_name || ')' AS "Location" ,
          jh.job_id AS "History Job ID" ,
          'worked from ' || TO_CHAR(jh.start_date, 'MM/DD/YYYY') || ' to ' || TO_CHAR(jh.end_date, 'MM/DD/YYYY') || ' as ' || jj.job_title || ' in ' || dd.department_name || ' department' AS "History Job Title"
   FROM employees e -- to get title of current job_id

   JOIN jobs j ON e.job_id = j.job_id -- to get name of current manager_id

   LEFT JOIN employees m ON e.manager_id = m.employee_id -- to get name of current department_id

   LEFT JOIN departments d ON d.department_id = e.department_id -- to get name of manager of current department
-- (not equal to current manager and can be equal to the employee itself)

   LEFT JOIN employees dm ON d.manager_id = dm.employee_id -- to get name of location

   LEFT JOIN locations l ON d.location_id = l.location_id
   LEFT JOIN countries c ON l.country_id = c.country_id
   LEFT JOIN regions r ON c.region_id = r.region_id -- to get job history of employee

   LEFT JOIN job_history jh ON e.employee_id = jh.employee_id -- to get title of job history job_id

   LEFT JOIN jobs jj ON jj.job_id = jh.job_id -- to get namee of department from job history

   LEFT JOIN departments dd ON dd.department_id = jh.department_id
   ORDER BY e.employee_id) AS EmployeeConsolidated ON e.employee_id = engine.ide
WHERE ticket_custom.name='chapter'
  AND ticket_custom.value LIKE '%c%'
  AND TYPE='New material'
  AND milestone='1.1.12'
  AND component NOT LIKE 'internal_engine'
GROUP BY ticket.id

The desired output should be like this:

{'Feature':"Case When ticket.status IN ('new','assigned') and ticket_custom.ticket = 3 then “Win” When  ticket.status IN ('new','assigned') and ticket_custom.ticket = 2 then “Lose” When ticket.status IN  ('new','assigned') and ticket_custom.ticket = 1 then “Draw” When ticket.status IN   ('new','assigned') and EmployeeConsolidated.id= 49887 then “Undefined” End as Feature," },
{'New':"SUM(CASE WHEN ticket.status IN ('new','assigned') THEN 1 ELSE 0 END) as New," },
{'Old':"SUM(CASE WHEN ticket.status IN ('new','assigned') THEN 1 ELSE 0 END) as Old," },
.......
1

There are 1 best solutions below

0
On

You can use SQLGlot to parse your query and iterate or find different expression types.

Parsing your original example yields an AST.

(CASE ifs:
  (IF this:
    (AND this:
      (IN this:
        (COLUMN this:
          (IDENTIFIER this: status, quoted: False), table:
          (IDENTIFIER this: ticket, quoted: False)), expressions:
        (LITERAL this: new, is_string: True),
        (LITERAL this: assigned, is_string: True)), expression:
      (EQ this:
        (COLUMN this:
          (IDENTIFIER this: ticket, quoted: False), table:
          (IDENTIFIER this: ticket_custom, quoted: False)), expression:
        (LITERAL this: 3, is_string: False))), true:
    (LITERAL this: Win, is_string: True)))