I'm stuck.
I have a dataframe where rows are created at the time a customer quotes cost of a product.
My (truncated) data:
import pandas as pd
d = {'Quote Date': pd.to_datetime(['3/10/2016', '3/10/2016', '3/10/2016',
'3/10/2016', '3/11/2017']),
'Customer Name': ['Alice', 'Alice', 'Bob', 'Frank', 'Frank']
}
df = pd.DataFrame(data=d)
I want to check, for each row, if this is the first interaction I have had with this customer in over a year. My thought is to check each row's customer name against the customer name in the preceding years worth of rows. If a row's customer name is not in the previous year subset, then I will append a True value to the new column:
df['Is New']
In practice, the dataframe's shape will be close to (150000000, 5) and I fear adding a calculated column will not scale well.
I also thought to create a multi-index with the date and then customer name, but I was not sure how to execute the necessary search with this indexing.
Please apply any method you believe would be more efficient at checking for the first instance of a customer in the preceding year.
Here is the first approach that came to mind. I don't expect it to scale that well to 150M rows, but give it a try. Also, your truncated data does not produce a very interesting output, so I created some test data in which some users are new, and some are not: