How to randomly map a proportion of data value to a specific category?

48 Views Asked by At

I have a dataset below which shows if a customer is a return customer or not. The end goal is for all returned customers, I need to map about 25% of them to 'yes 1 purchase' and 75% of them to 'yes >1 purchase'. I also need to set a seed to make sure the result does not change each time I re-run the process.

I researched on numpy random function and random seed function, but it seems they generate random numbers instead of randomly assign/map a proportion of data value to a specific category. Can anyone advise on how to do this?

import pandas as pd
import numpy as np

list_customer_name = ['customer1','customer2','customer3','customer4','customer5',
'customer6','customer7','customer8','customer9','customer10','customer11','customer12',
'customer13','customer14','customer15','customer16','customer17','customer18']
list_return_customer = ['yes','yes','yes','yes','yes','yes',
'yes','yes','yes','yes','yes','yes','yes','yes',
'yes','yes','no','no']

df_test = pd.DataFrame({'customer_name': list_customer_name,
                    'return_customer?':list_return_customer})

data looks like this

enter image description here

desired output looks like this - 25% of customers (4 customer highlighted in yellow) flagged "yes" in the "return_customers?" column are mapped to "yes 1 purchase", the remaining 75% of customers (12 customers highlighted in green) are mapped to "yes >1 purchase".

enter image description here

1

There are 1 best solutions below

7
On BEST ANSWER

The following code seems to match your specifications:

import random

import pandas as pd

random.seed(1234)

list_customer_name = ['customer1','customer2','customer3','customer4','customer5',
'customer6','customer7','customer8','customer9','customer10','customer11','customer12',
'customer13','customer14','customer15','customer16','customer17','customer18']

list_return_customer = ['yes','yes','yes','yes','yes','yes',
'yes','yes','yes','yes','yes','yes','yes','yes',
'yes','yes','no','no']

list_return_customer_final = ["yes >1 purchase" if status == "yes" else "no" for status in list_return_customer]

number_of_yes_1_purchase = 4

while number_of_yes_1_purchase > 0:
    rand_index = random.randint(0, len(list_return_customer_final) - 1)
    if list_return_customer_final[rand_index] == "yes 1 purchase" or list_return_customer_final[rand_index] == "no":
        continue
    list_return_customer_final[rand_index] = "yes 1 purchase"
    number_of_yes_1_purchase -= 1

df_test = pd.DataFrame({'customer_name': list_customer_name,
                        'return_customer?':list_return_customer,
                        'return_customer_final': list_return_customer_final})

print(df_test)

Explanations:

I used the random module and set the seed to and arbitrary value with random.seed(1234). Setting the seed allows random functions to behave the same every time we run the program.

I defined the number of "yes >1 purchase" to allocate with the variable number_of_yes_1_purchase. You can hardcode it or compute it depending on the length of list_return_customer (but remember to round the result to have an int).

With the while loop, I loop until I have allocated all of the "yes >1 purchase", so each time I allocate one I decrease the remaining number by one with number_of_yes_1_purchase -= 1

I used rand_index = random.randint(0, len(list_return_customer_final) - 1) to get a random index of the list to set to "yes 1 purchase". If this index is already a "yes 1 purchase" or a "no", I skip the current iteration with continue.

The loop ends when number_of_yes_1_purchase reaches 0.


If you have any questions, don't hesitate to ask