Drawing a random sample from a very large dataset

47 Views Asked by At

I have a csv dataset with 160MM rows that is not possible to import directly through Pandas (RAM memory is not enough). How could I draw a random sample of 5% from the original dataset (in this case, a sample with roughly 8MM rows)??? Amy insight is appreciated... Cheers, Marcelo

I have tried using chunks, but it did not work.

1

There are 1 best solutions below

0
John Zwinck On
  1. Determine the total rows N in the CSV. Maybe you already know. You only need to do this once, you can store it somewhere for repeated use.
  2. Generate random numbers in [0, N). See https://stackoverflow.com/a/77513347/4323
  3. Sort the random numbers.
  4. Read the CSV in a single pass. For each random number, skip lines until you reach that row. For faster line skipping, see Skip first couple of lines while reading lines in Python file

If that's not fast enough, here's another approach:

sample = subprocess.check_output(['shuf', '-n', '8000000', 'file.csv'])

It uses https://man7.org/linux/man-pages/man1/shuf.1.html which is probably faster than what you'll end up with in Python alone.