The following returns a 10% sample of the A and X columns stratified by the values of X.
select A, X from(
select A,
count(*) over (partition by X) as cnt,
rank() over (partition by X order by rand()) as rnk
from my_table) table
where rnk <= cnt*0.1
In other words, if X takes the values [X0, X1] it returns the union of:
- 10% of the the rows where
X = X0 - 10% of the the rows where
X = X1
How can I stratify my query by values of tuples for several columns (e.g. X, Y)?
For example, if X takes values [X0, X1] and Y takes values [Y0, Y1], I would like to get a sample that is the union of:
- 10% of the the rows where
X = X0andY=Y0 - 10% of the the rows where
X = X0andY=Y1 - 10% of the the rows where
X = X1andY=Y0 - 10% of the the rows where
X = X1andY=Y1
I'd use your method above, but use a hash of the columns you'd like to consider.