This is my first question to the stack overflow.
We are doing statistics for simulation data and I wanted to implement the bootstrap in Oracle-SQL (sampling data with replacement). My approach is as follows:
- Number all rows in the query
- Create a list of random numbers within the row range
- For each random number, join a row. -> i.e. sampling with replacement
However, there seems to be a problem with joins. Oracle is losing some rows.
Here is an example. When I do the following query, I get a random ordering of the rows:
WITH basis as (
SELECT 'A' AS KPI_KEY, 2 AS KPI_VALUE FROM dual
union all
SELECT 'A' AS KPI_KEY, 5 AS KPI_VALUE FROM dual
union all
SELECT 'A' AS KPI_KEY, 3 AS KPI_VALUE FROM dual
), group_counts AS (
SELECT KPI_KEY, COUNT(*) as total_count
FROM basis
GROUP BY KPI_KEY
)
, random_numbers AS (
SELECT KPI_KEY, ceil(dbms_random.value(0, max(total_count))) AS rand_rn
FROM group_counts
GROUP BY KPI_KEY
CONNECT BY LEVEL <= total_count AND PRIOR KPI_KEY = KPI_KEY AND PRIOR dbms_random.value IS NOT NULL
)
select * from random_numbers
;
Result (correct): KPI_KEY RAND_RN 'A' 1 'A' 2 'A' 2
Now, when I join the numbered rows like the following, the result is wrong:
WITH basis as (
SELECT 'A' AS KPI_KEY, 2 AS KPI_VALUE FROM dual
union all
SELECT 'A' AS KPI_KEY, 5 AS KPI_VALUE FROM dual
union all
SELECT 'A' AS KPI_KEY, 3 AS KPI_VALUE FROM dual
), group_counts AS (
SELECT KPI_KEY, COUNT(*) as total_count
FROM basis
GROUP BY KPI_KEY
)
, numbered_rows AS (
SELECT KPI_KEY, KPI_VALUE, ROW_NUMBER()
OVER (PARTITION BY KPI_KEY ORDER BY dbms_random.value) AS rn, total_count
FROM basis
JOIN group_counts USING (KPI_KEY)
)
, random_numbers AS (
SELECT KPI_KEY, ceil(dbms_random.value(0, max(total_count))) AS rand_rn
FROM group_counts
GROUP BY KPI_KEY
CONNECT BY LEVEL <= total_count AND PRIOR KPI_KEY = KPI_KEY AND PRIOR dbms_random.value IS NOT NULL
)
SELECT rn.KPI_KEY, nr.KPI_VALUE, nr.rn
FROM random_numbers rn
LEFT JOIN numbered_rows nr
ON rn.KPI_KEY = nr.KPI_KEY
and rn.rand_rn = nr.rn
;
Result (value is random, but missing rows from join): KPI_KEY KPI_VALUE RN 'A' 5 3
Someone who has had a similar problem before? Seems like an Oracle bug to me, but maybe I missed an important detail.
I tried the materialized hint, and also the rownum approach found here: Random join in oracle
The issue is that the
GROUP BYof the penultimate query is aggregating all the generated rows into a single row and theCONNECT BYclause effectively does unnecessary work for zero benefit.This appears to be because the SQL engine in rewriting the query and if you take the output from the penultimate query, without the ultimate query, then it performs
GROUP BYthenCONNECT BYand if you perform both the penultimate and ultimate queries then it performsCONNECT BYand thenGROUP BY(in the reverse order).Gives the plan:
And:
Gives the plan:
It may be possible to fix the query and materialise the output from the penultimate query so that the
EXPLAIN PLANis consistently generated with the ultimate part of your query; however, this seems to be a fragile solution and you would probably be better to rewrite the query entirely to use something that is (a) simpler and (b) less impacted by the compiler rewriting the query.Assuming that your logic is:
For each
kpi_key:kpi_key(there is a bug in your logic usingCEIL(DBMS_RANDOM.VALUE(0, num_rows))asDBMS_RANDOMwill include the lower-bound and exclude the upper-bound so you can get values starting from 0, which you do not want to include, and rounding up tonum_rowsbut the probabilities are not even as the upper-bound is excluded; instead you want to useFLOORand add 1).Self-join the rows so that each row is joined to its random counter-part.
Then you can rewrite the query using analytic functions:
Note: There is no point randomly ordering the
ROW_NUMBERoutput as you are comparing it to a random value and you will get the same randomness comparing a static-ordered list to random values within that list as you would when comparing a randomly-ordered list to random values within that list.Which may randomly output:
Note: You probably just want the last 3 columns corresponding to the random row selection.
fiddle