How to check each time-series entry if name/id is in previous years entries?

59 Views Asked by At

I'm stuck.

I have a dataframe where rows are created at the time a customer quotes cost of a product.

My (truncated) data:

import pandas as pd

d = {'Quote Date': pd.to_datetime(['3/10/2016', '3/10/2016', '3/10/2016', 
                                   '3/10/2016', '3/11/2017']),
     'Customer Name': ['Alice', 'Alice', 'Bob', 'Frank', 'Frank']
    }

df = pd.DataFrame(data=d)

I want to check, for each row, if this is the first interaction I have had with this customer in over a year. My thought is to check each row's customer name against the customer name in the preceding years worth of rows. If a row's customer name is not in the previous year subset, then I will append a True value to the new column:

df['Is New']   

In practice, the dataframe's shape will be close to (150000000, 5) and I fear adding a calculated column will not scale well.

I also thought to create a multi-index with the date and then customer name, but I was not sure how to execute the necessary search with this indexing.

Please apply any method you believe would be more efficient at checking for the first instance of a customer in the preceding year.

1

There are 1 best solutions below

1
On BEST ANSWER

Here is the first approach that came to mind. I don't expect it to scale that well to 150M rows, but give it a try. Also, your truncated data does not produce a very interesting output, so I created some test data in which some users are new, and some are not:

# Create example data
d = {'Quote Date': pd.to_datetime(['3/10/2016', 
                                   '3/10/2016', 
                                   '6/25/2016', 
                                   '1/1/2017', 
                                   '6/25/2017', 
                                   '9/29/2017']),
     'Customer Name': ['Alice', 'Bob', 'Alice', 'Frank', 'Bob', 'Frank']
    }
df = pd.DataFrame(d)
df.set_index('Quote Date', inplace=True)

# Solution
day = pd.DateOffset(days=1)
is_new = [s['Customer Name'] not in df.loc[i - 365*day:i-day]['Customer Name'].values
          for i, s in df.iterrows()]
df['Is New'] = is_new
df.reset_index(inplace=True)

# Result
df
  Quote Date Customer Name  Is New
0 2016-03-10         Alice    True
1 2016-03-10           Bob    True
2 2016-06-25         Alice   False
3 2017-01-01         Frank    True
4 2017-06-25           Bob    True
5 2017-09-29         Frank   False