Is it possible to split a dataset in Google Dataprep? If so, how?

454 Views Asked by At

I've been looking into Google Dataprep as an ETL solution to perform some basic data transformation before feeding it to a machine learning platform. I'm wondering if it's possible to use the Dataprep/Dataflow tools to split a dataset into train, test, and validation sets. Ideally I'm looking to do a stratified split on a target column, but for starters I'd settle for a simple uniform random split by percent of whole (e.g. 50% train, 30% validation, 20% test).

So far I haven't been able to find anything about whether this is even possible with Dataprep, so I'm wondering if anyone knows definitively if this is possible and, if so, how to accomplish it.

EDIT 1

Thanks @jakub-janoštík for getting me going in the right direction! I modified your answer slightly and came up with the following (in wrangle form):

case condition: customConditions cases: [false,0] default: rand() as: 'split_condition'
case condition: customConditions cases: [split_condition < 0.6,'train'],[split_condition >= 0.8,'test'] default: 'validation' as: 'dataset_type'
drop col: split_condition action: Drop

By assigning random values in a separate step, I got the guaranteed percentage split I was looking for. The flow ended up looking like this:

Image: final flow diagram with dataset splitting

EDIT 2

I just figured out how to do the stratified split too, so I thought I'd add it in case anyone else is trying to do this. Here's the rough steps:

  1. Split your dataset based on whatever subpopulations you're targeting (e.g. target0, target1)
  2. For each subpopulation, do the uniform random split described above (e.g. now you have target0-train, target0-test, target0-validation, target1-train, etc.)
  3. For each set type (i.e. train, test, validation):
    • Create a new recipe from one of the sets
    • Edit the recipe, and use the Union transform to merge it with other datasets of the same type (e.g. target0-train union with target1-train). The union button is in the middle of the toolbar on the Edit Recipe page.

I hope that's helpful to someone!

1

There are 1 best solutions below

0
On BEST ANSWER

I'm looking at the same problem and I was able to partially solve this using "case on custom condition" and "Random" functions. What I do is that I create new column named target and apply following logic:

enter image description here

After applying this you'll have new column with these 3 new labels and you can generate 3 new datasets by applying row filtering rules based on those values. Thing to keep in mind is that each time you'll run the job you'll get different validation set. So if you want to keep it fixed you need to use the dataset created in first run as input for future runs (and randomise only train and test sets).

If you need more control on the distribution of labels in your datasets there is ROWNUMBER window function that could potentially be used. But I haven't been able to make it work yet.