Suppose I have the following DataFrame
, and I want to shuffle the rows and columns of the DataFrame
with a specific seed value. I tried the following to obtain shuffled indexes, but it gave me a different result every time:
julia> using Random, DataFrames, StatsBase
julia> Random.seed!(123)
julia> df = DataFrame(
col1 = [1, 2, 3],
col2 = [4, 5, 6]
);
julia> idx_row, idx_col = sample.(
[1:size(df, 1), 1:size(df, 2)],
[length(1:size(df, 1)), length(1:size(df, 2))],
replace=false
)
2-element Vector{Vector{Int64}}:
[1, 2, 3]
[2, 1]
julia> idx_row, idx_col = sample.(
[1:size(df, 1), 1:size(df, 2)],
[length(1:size(df, 1)), length(1:size(df, 2))],
replace=false
)
2-element Vector{Vector{Int64}}:
[2, 1, 3]
[2, 1]
As you can see, it's shuffling the values, but it doesn't consider the seed!
. How can I shuffle rows and columns of a DataFrame in a reproducible way, like setting a specific seed?
Fortunately, you imported a helpful package named
Random
. However, you didn't search for the function namedshuffle
. All can be achieved by the following:The result is reproducible and won't change after each run, despite being a random process.
Additional point
Note that there is a customized dispatch of the
shuffle
function suitable for shuffling rows of a givenDataFrame
:*Note that this only shuffles the rows.