How to EFFICIENTLY query n records per category

118 Views Asked by At

To select N records per category one can do:

SELECT category, category_id, value FROM
(
    SELECT category, value, row_number() OVER (PARTITION by category) as category_id
    FROM myTable
)
WHERE  category_id < N;

The inner SELECT will first partition the records per category and assign each record per category an id called category_id. The outer query will then use the category_id to limit the number of records it queries per category.

This is extremely inefficient on BIG tables as it will be going through assigning ids to all the records even though we are only interested in N records per category.

The following does not work on the sql engine that I am working with - not sure if it works on any engine at all.

SELECT category, value, row_number() OVER (PARTITION by category) as category_id
FROM myTable
WHERE category_id < N

Does anyone know of any other ways of achieving this with a better time complexity?

More thoughts:

Time profiling the following algorithm against above query might provide more insights as to how the query runs behind the scene:

   1. SELECT DISTINCT(category) FROM myTable
   2. FOREACH category SELECT N rows

more info: my data is physically partitioned by category, being able to explicitly leverage that would be useful

2

There are 2 best solutions below

3
mustaccio On

As @Lamak mentioned in a comment, you cannot avoid sorting all rows in the table, but not for the reason stated. A sort is required to determine distinct categories by which the result set should be partitioned, and, in the absence of explicit ordering within each partition, row numbers are easily determined as a side effect of the category sort.

How the query runs "behind the scenes", or, if using the correct term, its execution plan is determined by the presence (or absence) of an index that might help avoid that category sort. If you had a covering index on (category, value), and whatever other columns you need in the result, your query would run much more efficiently.

In that latter case the simplified algorithm might look more like this:

  1. Read the pre-sorted records containing all necessary columns, including row numbers, from the index.
  2. Discard records with row number greater than n.

Your "ideal" query

SELECT category, value, row_number() OVER (PARTITION by category) as
category_id FROM myTable WHERE category_id < N

probably wouldn't run in any SQL database, because the SELECT list is processed after the WHERE clause predicates, so category_id is unknown when the predicates are evaluated.

0
Esperento57 On

Other method of rownumber, but i have doubts of performance too. I agree @mustaccio. My example take 5 rows...

select distinct f1.category, f3.*             
from yourtable f1                        
inner join lateral                                          
(                                                           
 select f2.value from yourtable f2              
 where f2.category=f1.category 
 fetch first 5 rows only                                    
) f3 on 1=1