Sample data set in Koalas

196 Views Asked by At

I have below code which uses pandas dataframe. However when i convert Pandas dataframe to Koalas and run the below code I get error "Function sample currently does not support specifying exact number of items to return. Use frac instead"

df.loc[df.sample(int(len(df) * .05)).index, 'distance'] = None

I tried using below code which give me random record. But how do it get all records in dataframe and replace the distance with null value for 5 % records

df.sample(frac=0.05, random_state=1)

1

There are 1 best solutions below

1
On

If you just want to keep 5% of the records in the distance column, you can use when with a rand random number:

import pyspark.sql.functions as F

df2 = df.withColumn('distance', F.when(F.rand(0) < 0.05, F.col('distance')))

If you want to stick with koalas and not Spark, you can do this:

import numpy as np

df.loc[np.random.choice(df.shape[0], int(df.shape[0]*0.05)).tolist(), 'distance'] = None