Storing an arithmetic parse tree in SQL

420 Views Asked by At

SQL is obviously not meant for this but I'm wondering if it's possible to do this with SQL as a sort of challenge. Given an arithmetic parse tree such as:

7-2*3

Which could be represented as:

    -
   / \
  7   *
     / \
    2   3

And stored in SQL as the following (assuming +-/* binary operations on integers only):

CREATE TABLE expression_tree AS (
    SELECT * FROM VALUES
    (1, 1, '-', NULL, NULL), -- "-" at top
    (1, 2, '7', 'L', 1), 
    (1, 3, '*', 'R', 1), 
    (1, 4, '2', 'L', 3), 
    (1, 5, '3', 'R', 3)
    AS tmp(expression_id, node_id, value, side, parent_node_id)
)

┌───────────────┬─────────┬───────┬──────┬────────────────┐
│ expression_id ┆ node_id ┆ value ┆ side ┆ parent_node_id │
╞═══════════════╪═════════╪═══════╪══════╪════════════════╡
│             1 ┆       1 ┆ -     ┆      ┆                │
│             1 ┆       2 ┆ 7     ┆ L    ┆              1 │
│             1 ┆       3 ┆ *     ┆ R    ┆              1 │
│             1 ┆       4 ┆ 2     ┆ L    ┆              3 │
│             1 ┆       5 ┆ 3     ┆ R    ┆              3 │
└───────────────┴─────────┴───────┴──────┴────────────────┘

Would it be possible to write a WITH RECURSIVE cte that could return 7-2*3 for the above? To start I would have something like:

WITH RECURSIVE expression(node_id, node_value, expr_value) AS (
    SELECT node_id, value, value FROM expression_tree WHERE parent_node_id IS NULL
    UNION ALL
    SELECT et.node_id, et.value, 
        CASE WHEN side='L' 
          THEN CONCAT(value, expr_value) 
          ELSE CONCAT(expr_value, value) 
        END
        FROM expression_tree et JOIN expression e ON  et.parent_node_id=e.node_id
) 
SELECT * FROM expression

┌─────────┬────────────┬────────────┐
│ node_id ┆ node_value ┆ expr_value │
╞═════════╪════════════╪════════════╡
│       1 ┆ -          ┆ -          │
│       2 ┆ 7          ┆ 7-         │
│       3 ┆ *          ┆ -*         │
│       4 ┆ 2          ┆ 2-*        │
│       5 ┆ 3          ┆ -*3        │
└─────────┴────────────┴────────────┘

Here is the start of it: https://sqlfiddle.com/sql-server/online-compiler?id=fad7e1c0-985a-49c6-9f1b-fbbe9b83c5eb.

My difficulty in the above is basically that I need both node child node values at the same time otherwise, it will keep overwriting the L child when it gets the R child. How could that be fixed?

6

There are 6 best solutions below

5
Charlieface On BEST ANSWER

Recursive aggregation is difficult if not impossible using a recursive CTE. It is often easier and more efficient to use a temp table and loop until you have no rows left to aggregate.

If you want to keep it as a single query, then in SQL Server you can use a recursive scalar function.

This is made more complicated by the fact that the table is not properly normalized. The values and operators should be in separate columns, if not separate tables. Also expression_id seems unnecessary, as the root node_id should uniquely identify an expression.

  • If the value of the node is convertible to a number then use that.
  • Else take the child nodes and pivot them up into left and right...
  • ... taking in each case the recursive value of the node.
  • ... And use the correct operator to combine them.
CREATE OR ALTER FUNCTION dbo.CalculateNode (@node_id bigint)
RETURNS bigint
AS BEGIN
  RETURN (
    SELECT
      CASE
        WHEN et.value NOT LIKE '%[^0-9]%'
        THEN
          TRY_CAST(et.value AS bigint)
        ELSE
        (
          SELECT
            CASE et.value
              WHEN '+' THEN L + R
              WHEN '-' THEN ISNULL(L, 0) - R
              WHEN '*' THEN L * R
              WHEN '/' THEN L / R
              WHEN '%' THEN L % R
            END
          FROM (
              SELECT
                MIN(CASE WHEN child.side = 'L' THEN v.value END) AS L,
                MIN(CASE WHEN child.side = 'R' THEN v.value END) AS R
              FROM expression_tree child
              CROSS APPLY (SELECT dbo.CalculateNode(child.node_id)) v(value)
              WHERE child.parent_node_id = et.node_id
          ) children
        )
      END
    FROM expression_tree et
    WHERE et.node_id = @node_id
  );
END;

Finally just select the parent node

SELECT et.expression_id, dbo.CalculateNode(et.node_id) AS final_value
FROM expression_tree et
WHERE et.parent_node_id IS NULL;

db<>fiddle

3
Charlieface On

As I mentioned in my other answer, it is also possible to do this using a temp table. This is again likely more efficient and easier than using a recursive CTE.

Note that the following solution is set-based where possible. The number of loops is only up to the max depth of the tree, not for each node.

First, insert all the data into a temp table.

CREATE TABLE #tree (
  node_id bigint PRIMARY KEY,
  value bigint,
  op varchar(2),
  side char(1),
  parent_node_id bigint,
  INDEX IX (parent_node_id) INCLUDE (side, value),
  CHECK (parent_node_id <> node_id)
);

INSERT #tree
SELECT
  et.node_id,
  CASE WHEN et.value NOT LIKE '%[^0-9]%' THEN et.value END,
  CASE WHEN et.value LIKE '%[^0-9]%' THEN et.value END,
  et.side,
  et.parent_node_id
FROM expression_tree et
WHERE et.expression_id = 1;

Then continuously loop as follows:

  • Take all nodes which have parent nodes, and aggregate by parent node
  • ... where there are no missing values
  • ... and pivot up the left and right side.
  • Join back to the table, and make the calculation based on the operator.
  • Make the update, storing in variables the final result along with whether it has a parent.
  • Break when nothing to update or we got the parent.
DECLARE @final bigint, @parent bigint;

WHILE 1=1
BEGIN
    WITH child AS (
        SELECT
          child.parent_node_id,
          MIN(CASE WHEN child.side = 'L' THEN child.value END) AS L,
          MIN(CASE WHEN child.side = 'R' THEN child.value END) AS R
        FROM #tree child
        WHERE child.parent_node_id IS NOT NULL
        GROUP BY
          child.parent_node_id
        HAVING COUNT(CASE WHEN child.value IS NULL THEN 1 END) = 0
    )
    UPDATE t
    SET @final = v.final,
        value = v.final,
        @parent = t.parent_node_id
    FROM #tree t
    JOIN child c ON c.parent_node_id = t.node_id
    CROSS APPLY (
        SELECT CASE t.op
              WHEN '+' THEN L + R
              WHEN '-' THEN ISNULL(L, 0) - R
              WHEN '*' THEN L * R
              WHEN '/' THEN L / R
              WHEN '%' THEN L % R
          END
    ) v(final)
    WHERE t.value IS NULL;
    
    IF @@ROWCOUNT = 0 OR @parent IS NULL
        BREAK;
END;

SELECT @final;

db<>fiddle

2
ADTC On

Note: For a solution, see my other answer: https://stackoverflow.com/a/78238561/1134080

What comes below is more of a lengthy comment than a solution.


Is it mandatory to represent the parse tree in a table with 5 rows the way you did?

If the smallest possible subtree always has three nodes connected by two vectors (except for the single-node edge case), we could use a different representation which only requires 2 rows.

Columns:
EI: expression_id      PK
SI: subtree_id         PK
LV: left_node_value    string   nullable
PI: parent_subtree_id  FK       nullable
PS: parent_side        'L'|'R'  nullable
PV: parent_node_value  string   nullable
RV: right_node_value   string   nullable

Rows:
EI │ SI │ LV  │  PI  │  PS  │  PV  │ RV
 1 │  1 │ '7' │ NULL │ NULL │  '-' │ '*'
 1 │  2 │ '2' │    1 │  'R' │ NULL │ '3'

Constraints:

  1. Either PI or PV must be filled, but not both at the same time.
  2. If PI is filled, PS must be filled.
  3. Every combination of PI and PS (including NULL, NULL) must be unique for each EI.
  4. If LV is null, PI, PS and RV must be null, with PV filled. This is for edge case of a single-node expression like a.

It would be relatively easier to build the expression whether recursively or iteratively.

Iterative approach (very easy):

  1. Start with SI 1: Build expression 7 - * (an expandable array-like data structure would be best to hold it).
  2. Move to SI 2: Replace * in previous expression to get 7 - (2 * 3)
  3. Once all rows are exhausted, concatenate everything in the array-like data structure into a string.

Recursive approach (a bit more complex):

  1. Start with SI 1 which has NULL, NULL for PI, PS: Check if the left and right nodes are terminal or non-terminal.
  2. If a non-terminal node is present, find the row that matches the node. That is, EI, SI, side of node (L or R) of current node matches EI, PI, PS of found node.
    • If node found, recurse to step 1 on the found node and repeat.
    • If no node found, return the current node's expression itself.
  3. If both nodes are terminal, return their combination up the chain until we get back to SI 1.

PS: It looks like there's a lack of clarity whether you want the function to return the expression or the computed value. Your question is asking for expression, but many answers and comments are talking about computing the value. I'm assuming you want the expression, hence I'm using the strings.

5
ADTC On

Here's a possible solution made while sticking to the original question scenario as closely as possible:

db<>fiddle (Note: tested with PostgreSQL 16.)

The result is in the last column of the last row in the table given by this SQL. The way I do this is by flattening the left and right into single row, and also by using REPLACE[node_id] as a method of communicating from higher row down to the lower row.

example result

WITH RECURSIVE expression(left_node_id, right_node_id, left_value, right_value, expr_value) AS (
    SELECT -- parent_node_id,
           node_id as left_node_id,
           node_id as right_node_id,
           value, value, CONCAT('REPLACE[', node_id, ']')
      FROM expression_tree WHERE parent_node_id IS NULL
    UNION ALL
    SELECT -- et.parent_node_id,
           et.left_node_id, et.right_node_id,
           et.left_value, et.right_value, 

  REPLACE( expr_value, CONCAT('REPLACE[', et.parent_node_id, ']'),

  CONCAT(CASE WHEN et.left_value IN ('+', '-', '*', '/')
          THEN CONCAT('REPLACE[', et.left_node_id, ']')
          ELSE et.left_value
         END, 

  CASE WHEN et.parent_node_id=e.left_node_id
               THEN e.left_value
               ELSE e.right_value
            END,

  CASE WHEN et.right_value IN ('+', '-', '*', '/')
          THEN CONCAT('REPLACE[', et.right_node_id, ']')
          ELSE et.right_value
         END)

  ) -- END REPLACE
  
  FROM (
          SELECT etl.expression_id, etl.parent_node_id,
            etl.node_id as left_node_id,
            etr.node_id as right_node_id,
            etl.value as left_value,
            etr.value as right_value
          FROM (SELECT * FROM expression_tree WHERE side='L') etl
          JOIN (SELECT * FROM expression_tree WHERE side='R') etr
          ON etl.expression_id = etr.expression_id
          AND etl.parent_node_id = etr.parent_node_id
        ) et JOIN expression e
               ON  et.parent_node_id=e.left_node_id
               OR  et.parent_node_id=e.right_node_id
) 
SELECT * FROM expression;

Note: You cannot add parent_node_id in the UNION ALL or the server will run out of heap space. I think it will become infinite recursion, maybe.

Note: This assumes that brackets (parenthesis) is not necessary to represent the parse tree. If they are necessary, some additional logic will need to be added to detect such scenario correctly and add the brackets:

db<>fiddle

example result with brackets

-- same as before

  CONCAT(
  CASE WHEN
      (CASE WHEN et.parent_node_id=e.left_node_id
          THEN e.left_value
          ELSE e.right_value
       END)
  IN ('+', '-') THEN '(' ELSE '' END,
  
  -- same three cases as before

  CASE WHEN
      (CASE WHEN et.parent_node_id=e.left_node_id
          THEN e.left_value
          ELSE e.right_value
       END)
  IN ('+', '-') THEN ')' ELSE '' END)
  ) -- END REPLACE

-- same as before

To avoid adding brackets at the root level, add AND et.parent_node_id != 1 to both cases above (after WHEN .. IN (..) before THEN) then you will have 7-2*3 in this example, but you can also have (7-2)*3 in a different parse tree where the subtraction should be done before the multiplication.

Final

This will be the final answer, incorporating ALL of the above, and also correcting for expression_id not being checked in above versions:

db<>fiddle

two expressions result

The second expression has the parse tree:

    *
   / \
  -   3
 / \
7   2

Bonus:

Just the final expression values only: db<>fiddle

bonus results

2
ValNik On

See example. Recursive assembling expression formula from expression_tree table.

with triple as( --  recursive if PostgreSql
    select expression_id,oid
        ,concat(
            case when lv in('*','+','-','/') then concat('{',lid,'}') 
            else cast(lv as varchar) end 
            ,ov
            ,case when rv in('*','+','-','/') then concat('{',rid,'}') 
             else cast(rv as varchar) end
            )frm
    from(
        select o.expression_id,o.node_id oId,o.value oV,o.side oSide
          ,l.node_id lId,l.value lV,l.side lSide, r.node_id rId,r.value rV,r.side rSide
        from expression_tree O
        inner join expression_tree L on l.expression_id=o.expression_id
              and l.parent_node_id=o.node_id and l.side='L'
        inner join expression_tree R on r.expression_id=o.expression_id 
            and r.parent_node_id=o.node_id and r.side='R'
    )t
)
,r as(
    select 0 lvl, expression_id,oid,cast(frm as varchar)frm --,lid,rid
    from triple
    union all
    select lvl+1,r.expression_id,r.oid
        ,cast(
          replace(r.frm,concat('{',t1.oid,'}'),concat('(',t1.frm,')'))
            as varchar)frm
    from r 
    inner join triple t1 on t1.expression_id=r.expression_id 
        and r.frm like concat('%{',t1.oid,'}%')
)
select * 
from (select *,row_number()over(partition by expression_id order by lvl desc) rn from r)x
where rn=1 ;

For test I take 4 expressions:

  1. 7-(2*3)
  2. 9-(3*(4+5))
  3. (13+14)*(15/16)
  4. (6+(3/3)-2*3)

Query from 3 parts:

1.Take all three-nodes (triple) <Left(l)-parent(o)-right(r)>. Parent is operation ('+','-','*','/')

expression_id oid frm
1 1 "7-{3}"
1 3 "2*3"
2 1 "9-{3}"
2 3 "3*{5}"
2 5 "4+5"
3 1 "{2}*{3}"
3 2 "13+14"
3 3 "15/16"
4 1 "{2}-{3}"
4 2 "6+{7}"
4 3 "2*3"
4 7 "3/3"
  1. Recursive assembling the formula (CTE r).
    There frm - expression formula.
    We will explicitly denote the order of execution with parentheses so that we do not have to take into account the priorities of arithmetic operations.
lvl expression_id oid frm
1 1 1 "7-(2*3)"
2 2 1 "9-(3*(4+5))"
2 3 1 "(13+14)*(15/16)"
3 4 1 "(6+(3/3))-(2*3)"

For example (13+14)*(15/16)

1.1 {2}*{3} where {2}=(13+14) and {3}=(15/16)  

Next step generate 2 rows.
You say My difficulty in the above is basically that I need both node child node values at the same time otherwise, it will keep overwriting the L child when it gets the R child. How could that be fixed?.
We can take any option. We take all rows.

2.1 (13+14)*{3} 
2.2 {2}*(15/16)    

Next step generate also 2 rows from 2 rows of previous level. And can take any of them.

3.1 (13+14)*(15/16) 
3.2 (13+14)*(15/16)

We take all rows.

  1. Take result row - any of rows with max level of recursion (row_number()...).

Recursion result (full)

lvl expression_id oid frm (formula) rn
0 1 3 "2*3" 2
0 1 1 "7-{3}" 3
1 1 1 "7-(2*3)" 1
0 2 3 "3*{5}" 6
0 2 5 "4+5" 4
0 2 1 "9-{3}" 5
1 2 3 "3*(4+5)" 2
1 2 1 "9-(3*{5})" 3
2 2 1 "9-(3*(4+5))" 1
0 3 3 "15/16" 6
0 3 2 "13+14" 5
0 3 1 "{2}*{3}" 7
1 3 1 "{2}*(15/16)" 4
1 3 1 "(13+14)*{3}" 3
2 3 1 "(13+14)*(15/16)" 1
2 3 1 "(13+14)*(15/16)" 2
0 4 1 "{2}-{3}" 10
0 4 2 "6+{7}" 12
0 4 3 "2*3" 11
0 4 7 "3/3" 13
1 4 1 "{2}-(2*3)" 8
1 4 1 "(6+{7})-{3}" 9
1 4 2 "6+(3/3)" 7
2 4 1 "(6+{7})-(2*3)" 4
2 4 1 "(6+(3/3))-{3}" 6
2 4 1 "(6+{7})-(2*3)" 5
3 4 1 "(6+(3/3))-(2*3)" 1
3 4 1 "(6+(3/3))-(2*3)" 3
3 4 1 "(6+(3/3))-(2*3)" 2

Tested for SQL Server and PostgreSQL

Demo

Upd1. Answer corrected after comment about testing (6+(3/3)-2*3).

0
gordy On

for the recursive query that you were going for in the question - I think the trick is to think inside-out, start with the leafs and traverse to the parents, concat the values on the way up and land at the top with the whole expression, eg:

with leafs as (
    select *, s = value from expression_tree t where not exists
        (select * from expression_tree where parent_node_id = t.node_id)

    union all

    select p.*, concat(l.s, p.value, r.s) from expression_tree p
    join leafs l on l.parent_node_id = p.node_id and l.side = 'L'
    join leafs r on r.parent_node_id = p.node_id and r.side = 'R')

the sql seems perfectly clear to me but none of the databases I tried this on would execute it. I tried postgres, sql server, db2, oracle, and mariadb - none of them allow the recursive part to be referenced more than once.

@MatBailie's solution in the comments is a brilliant workaround for this. I rewrote it to make it more clear (to me) but the gist is to rank the values and then use string_agg to assemble the expression in the right order:

with r as (
    select *, x = convert(real, 0), y = convert(real, 1)
    from expression_tree where parent_node_id is null

    union all

    select t.*, iif(t.side = 'L', x - y, x + y), y / 2
    from r join expression_tree t on t.parent_node_id = r.node_id)

select string_agg(value, '') within group (order by x) from r

6+(3/3)-2*3