Python random seed over pandas and huggingface

54 Views Asked by At

I am currently working on reproducing the results of a research paper, which has made its dataset available on Hugging Face Hub. The paper outlines a specific method and random generator seed, using pandas, to split the dataset into training and testing sets. Here is the code snippet used in the paper:

import pandas as pd
train_size = 0.8            
train_dataset = new_df.sample(frac=train_size, random_state=200)
test_dataset = new_df.drop(train_dataset.index).reset_index(drop=True)
train_dataset = train_dataset.reset_index(drop=True)

However, since I cannot use the pandas’ method on a DatasetDict, I attempted to split the dataset using a different method with the same random generator seed. Unfortunately, this produced different results. Here is the code snippet for my approach:

ds = dataset["train"].train_test_split(test_size=0.2, seed=200, shuffle=False)

Could you please suggest a way to split the dataset that would result in the same training and testing sets specified in the paper?

0

There are 0 best solutions below