I am currently working on reproducing the results of a research paper, which has made its dataset available on Hugging Face Hub. The paper outlines a specific method and random generator seed, using pandas, to split the dataset into training and testing sets. Here is the code snippet used in the paper:
import pandas as pd
train_size = 0.8
train_dataset = new_df.sample(frac=train_size, random_state=200)
test_dataset = new_df.drop(train_dataset.index).reset_index(drop=True)
train_dataset = train_dataset.reset_index(drop=True)
However, since I cannot use the pandas’ method on a DatasetDict, I attempted to split the dataset using a different method with the same random generator seed. Unfortunately, this produced different results. Here is the code snippet for my approach:
ds = dataset["train"].train_test_split(test_size=0.2, seed=200, shuffle=False)
Could you please suggest a way to split the dataset that would result in the same training and testing sets specified in the paper?