How to shuffle the rows and columns of a DataFrame with a specific seed?

447 Views Asked by At

Suppose I have the following DataFrame, and I want to shuffle the rows and columns of the DataFrame with a specific seed value. I tried the following to obtain shuffled indexes, but it gave me a different result every time:

julia> using Random, DataFrames, StatsBase

julia> Random.seed!(123)

julia> df = DataFrame(
           col1 = [1, 2, 3],
           col2 = [4, 5, 6]
       );

julia> idx_row, idx_col = sample.(
           [1:size(df, 1), 1:size(df, 2)],
           [length(1:size(df, 1)), length(1:size(df, 2))],
           replace=false
       )
2-element Vector{Vector{Int64}}:
 [1, 2, 3]
 [2, 1]

julia> idx_row, idx_col = sample.(
           [1:size(df, 1), 1:size(df, 2)],
           [length(1:size(df, 1)), length(1:size(df, 2))],
           replace=false
       )
2-element Vector{Vector{Int64}}:
 [2, 1, 3]
 [2, 1]

As you can see, it's shuffling the values, but it doesn't consider the seed!. How can I shuffle rows and columns of a DataFrame in a reproducible way, like setting a specific seed?

2

There are 2 best solutions below

0
On

Fortunately, you imported a helpful package named Random. However, you didn't search for the function named shuffle. All can be achieved by the following:

julia> @which shuffle
Random

julia> idx_row, idx_col = shuffle.(
           MersenneTwister(123),
           [1:size(df, 1), 1:size(df, 2)]
       )
2-element Vector{Vector{Int64}}:
 [3, 2, 1]
 [2, 1]

julia> df[idx_row, idx_col]
3×2 DataFrame
 Row │ b      a
     │ Int64  Int64
─────┼──────────────
   1 │     6      3
   2 │     5      2
   3 │     4      1

The result is reproducible and won't change after each run, despite being a random process.

Additional point

Note that there is a customized dispatch of the shuffle function suitable for shuffling rows of a given DataFrame:

julia> shuffle(MersenneTwister(123), df)
3×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     3      6
   2 │     1      4
   3 │     2      5

*Note that this only shuffles the rows.

0
On

You can choose whatever rng you want, e.g. rng = MersenneTwister(113), and use it to shuffle the range of DataFrame size.

r,c = shuffle.(rng, range.(1,size(df)))
([3, 1, 2], [2, 1])

df[r,c]
3×2 DataFrame
 Row │ col2   col1
     │ Int64  Int64
─────┼──────────────
   1 │     5      2
   2 │     6      3
   3 │     4      1