I'm using the following code from the SDV library to create a synthetic dataset that's the same shape as my original dataset. While each synthetic dataset is different than the original dataset, all synthetic datasets are identical to each other. I would have thought there would be some randomness built into the synthetic data generation process so that each output would be slightly different. This occurs across sessions even when I set a different random seed. How should I interpret what's happening?
metadata.detect_from_dataframe(data=input_data)
synthesizer = SingleTablePreset(metadata=metadata,name='FAST_ML')
synthesizer.fit(data=input_data)
synthetic_data = synthesizer.sample(num_rows=len(input_data))```
I believe SDV synthesizers set an internal seed when they run, which explains the determinism you're seeing. This is expected behavior.
If you want different data, you can call the
samplemethod multiple times. Every subsequent run should give you different data. In the code below, all 3 samples of synthetic data will be different.For more info, see the sampling docs, particularly the
reset_samplingmethod to get back to the initial state.BTW the team is always looking for feedback. For supporting more randomization options, you can file a feature request directly on the GitHub.