Aggregate the same column in multiple different ways

1.2k Views Asked by At

I am trying to get an array of categories associated with each product and then also get the top-level parent category of each product in another column, which by my logic is finding the same values for the categories array, but only selecting where parent_id is NULL which should pull back only one value and 1 record per id.

I really don't know the best way to structure this query. What I have kind of works, but it also shows NULL values in the parent category column for the categories that do have a parent ID and makes a second record for each product because I am forced to put it in the group by. Basically, I think I am not doing this in the correct or most efficient way.

Desired result:

+----+----------------+------------------+------------------------------------------------+------------------+
| id | name           | category_ids     | category_names                                 | parent_category  |
+----+----------------+------------------+------------------------------------------------+------------------+
| 1  | Product Name 1 | {111,222,333}    | {Electronics, computers, computer accessories} | Electronics      |
+----+----------------+------------------+------------------------------------------------+------------------+

My current query (which is not ideal):

select p.id, 
p.name, 
array_agg(category_id) as category_ids,
regexp_replace(array_agg(c.name)::text,'"|''','','gi') as category_names,
c1.name as parent_category
from products p
join product_categorizations pc  on pc.product_id = p.id
join categories c  on pc.category_id = c.id
full outer join (
   select name, id from categories
   where parent_id is null and name is not null
   ) c1 on c.id = c1.id
group by 1,2,5;
+----+----------------+------------------+-----------------------------------+------------------+
| id | name           | category_ids     | category_names                    | parent_category  |
+----+----------------+------------------+-----------------------------------+------------------+
| 1  | Product Name 1 | {111}            | {Electronics}                     | Electronics      |
+----+----------------+------------------+-----------------------------------+------------------+
| 1  | Product Name 1 | {222,333}        | {computers, computer accessories} | NULL             |
+----+----------------+------------------+-----------------------------------+------------------+
1

There are 1 best solutions below

1
On BEST ANSWER

Replace the FULL JOIN with an aggregate FILTER clause:

SELECT p.id
     , p.name
     , array_agg(pc.category_id) AS category_ids
     , string_agg(c.name, ', ')  AS category_names  -- regexp_replace .. ?
     , min(c.name) FILTER (WHERE c.parent_id IS NULL) AS parent_category
FROM   products                p
JOIN   product_categorizations pc ON pc.product_id = p.id
JOIN   categories              c  ON pc.category_id = c.id
GROUP  BY p.id;

See:

(Why would you add AND name IS NOT NULL? Either way, min() ignores NULL values anyway.)

While aggregating all products, and while referential integrity is enforced, this should be a bit faster:

SELECT p.name, pc.*
FROM   products p
JOIN  (
   SELECT pc.product_id AS id
        , array_agg(pc.category_id) AS category_ids
        , string_agg(c.name, ', ')  AS category_names
        , min(c.name) FILTER (WHERE c.parent_id IS NULL) AS parent_category
   FROM   product_categorizations pc
   JOIN   categories              c  ON pc.category_id = c.id
   GROUP  BY 1
   ) pc  USING (id);

The point being that product only joins after aggregating rows.

Aside: "name" is not a very helpful column name. Related: