Remove rows containing null values efficiently

665 Views Asked by At

What is the most efficient way to remove rows containing any null values, for a table with many rows?

Additionally, why do I get the error

ERROR: 'time (attempt to use variable time without defining/assigning first

even though I have nothing named as ‘time’ when using the functional select

?[t;{(not;(null;x))} each cols t; 0b; ()]

given in this S/Oanswer?

1

There are 1 best solutions below

0
SJT On BEST ANSWER

Is your table in memory or splayed on disk?

If your table is in memory, then a simple form might also be the fastest.

show t:flip`a`b`c`d`e!(1+til 100)*/:100 cut 500?(1 0N)where 19 1
a  b  c  d  e
--------------
1  1     1  1
2  2  2  2  2
3  3  3  3  3
4  4  4  4  4
5  5  5  5  5
6  6  6  6  6
7  7  7  7  7
8  8  8  8  8
9  9  9  9  9
10 10 10 10 10
11 11 11 11 11
12 12 12 12 12
13    13 13 13
14 14 14 14 14
15 15 15 15 15
16 16    16 16
   17 17 17 17
18 18 18 18 18
..

q)\ts:1000 ?[t;{(not;(null;x))}each cols t;0b;()]
5 7552
q)\ts:1000 t where not any null flip t
2 6768

If the table is splayed on disk, the game changes. In a splayed table each column is stored as a file. The second argument of the Select operator is a list of constraints; each one a parse tree. Whereas null flip t above tests every cell of t, Select applies is constraints successively: only rows that have passed previous constraints are tested for the next. If no rows satisfy the first constraint/s subsequent column files may never need to be read at all.

From this follows the rule: list your constraints in descending order of restrictiveness – if nulls are more common in certain columns, test them first.

Regarding an in-memory test, note that not any can be a good deal faster than all not, depending on whether nulls are common.

q)q:null flip t
q)\ts:10000 all not q
9 1168
q)\ts:10000 not any q
6 656