Validating LTREE hierarchies in PostgreSQL

319 Views Asked by At

I'm new to hierarchies in general and LTREE in particular. As I've been converting and loading a column of text-based hierarchies into an LTREE column, I noticed a poorly-formatted string.

create table test_tree(id int, path ltree);
insert into test_tree values (1, '1');
insert into test_tree values (1, '1.1');
insert into test_tree values (1, '1.2.0'); --should be '1.2'
insert into test_tree values (1, '1.2.1');
insert into test_tree values (1, '1.2.2.0'); --should be '1.2.2'
insert into test_tree values (1, '1.2.2.1');
insert into test_tree values (1, '1.2.2.2');

This results in some unexpected behavior.

select path from test_tree where path <@ '1';

returns descendants, i.e.:

1
1.1
1.2.0
1.2.1
1.2.2.0
1.2.2.1
1.2.2.2

Whereas:

select path from test_tree where path @> '1.2.2.2';

only returns

1.2.2.2

I would expect <@ '1' to returns results consistent with @> '1.2.2.2'. In this instance how can an ancestor know its descendants, but a descendant not know its ancestors? Why does <@ '1' return all offspring (seemingly ignoring the missing '1.2.2') but @> '1.2.2.2' return no ancestors?

Moreover, how can I find these missing relationships in LTREE datatypes?

2

There are 2 best solutions below

0
Bergi On BEST ANSWER

The ltree operators do not care what values you have in your table or not. They only compare two ltree values. '1' @> '1.2.2.2' is true, '1.2.2.2' @> '1.2.2.2' is true, '1.1' @> '1.2.2.2' is not.

But a SELECT query returns only the rows that actually exist in your table. '1.2' @> '1.2.2.2' and '1.2.2' @> '1.2.2.2' would have been true as well, however those two values do not exist in your table, so they cannot be found. The @>/<@ operators do not construct new rows.

To actually construct all possible ancestors of an ltree value, not just those that are part of your table, you can use

SELECT subpath(p, 0, generate_series(1, nlevel(p)))

(online demo)

You also seem to have assumed an implicit constraint from using ltree columns that the parent value exists in the same column of the table. This is not possible, in relational databases rows are independent from each other: an ltree value is not a reference to another row plus the last label, it really is just a list of labels; every row in your table stores the complete label path. Using a specific type for a column cannot introduce a constraint, you'd have to do that yourself - either as a complicated foreign key from a generated column, or using a trigger.

How can I find these missing relationships in LTREE datatypes?

You can find such missing rows in your table (not in the datatype) using

SELECT path, array_agg(id) AS required_by
FROM (
  SELECT id, subpath(path, 0, generate_series(1, nlevel(path)-1)) AS path
  FROM test_tree
  ORDER BY path, id
) AS all_parent_paths
WHERE NOT EXISTS (SELECT * FROM test_tree WHERE path = all_parent_paths.path)
GROUP BY path

(online demo)

0
msmer On

I also faced with problem of data inconsistency when using ltree extension. As Bergi rightly said, the tree does not care or check what you put in this field, and that this problem can be solved with triggers.

In order to ensure data consistency in a table containing an ltree field, you need to:

  • Make the path field not null;
  • Add a unique index for the path field;
  • Add triggers:
    • BEFORE INSERT and BEFORE UPDATE, which check that the last label in the path field is the id of the record, and that there exists a record in the table with a path corresponding to the path of the record without the last label.
    • AFTER UPDATE, which updates the path of all child records when the path (location) of a parent record is changed.
    • BEFORE DELETE, which prevents records with child records from being deleted.

I went through it and have shared my experience in this story: Ensuring data consistency when using PostgreSQL ltree extension. Hope it will help you. This is a good solution, but first you will need to get the data in a consistent state.