Stratified Sampling in Hive

4.2k Views Asked by Amelio Vazquez-Reina At 12 August 2014 at 21:50

The following returns a 10% sample of the A and X columns stratified by the values of X.

  select A, X from(
  select A, 
      count(*) over (partition by X) as cnt, 
      rank() over (partition by X order by rand()) as rnk
      from my_table) table 
  where rnk <= cnt*0.1

In other words, if X takes the values [X0, X1] it returns the union of:

10% of the the rows where X = X0
10% of the the rows where X = X1

How can I stratify my query by values of tuples for several columns (e.g. X, Y)?

For example, if X takes values [X0, X1] and Y takes values [Y0, Y1], I would like to get a sample that is the union of:

10% of the the rows where X = X0 and Y=Y0
10% of the the rows where X = X0 and Y=Y1
10% of the the rows where X = X1 and Y=Y0
10% of the the rows where X = X1 and Y=Y1

Original Q&A

There are 1 best solutions below

Jim Murphy On 22 June 2015 at 15:00

I'd use your method above, but use a hash of the columns you'd like to consider.

Stratified Sampling in Hive

There are 1 best solutions below

Related Questions in SQL

Related Questions in HIVE

Related Questions in QUBOLE

Trending Questions

Popular # Hahtags

Popular Questions