I have a dataset with millions of records just like below
| CustomerID | StartTime | EndTime |
|---|---|---|
| 1111 | 2015-7-10 | 2016-3-7 |
| 1112 | 2016-1-5 | 2016-1-19 |
| 1113 | 2015-10-18 | 2020-9-1 |
This dataset contains the information for different subscription contracts and it is assumed that:
- if the contract is active then the customer will need to pay a monthly fee in advance. The first payment will be collected on the start date.
- If the contract ends before the next payment date, which is exactly one month after the last payment date, the customer does not need to pay the next subscription. For instance, customer 1112 only needs to pay once.
- monthly payment fee is $10
In this situation, I need to calculate the monthly/quarterly/annual sales between 2015 and 2020. It is ideal to also show the breakdown of sales by different customer IDs so that subsequent machine learning tasks can be performed.
Importing data (I saved your table as a .csv in Excel, which is the reason for the specific formatting of the
pd.to_datetime):Calculate the number of months between the start and end dates (+1 at the end because there will be a payment even if the contract is not active for a whole month, because it is in advance):
Generate a list of payment dates (from the start date, for the given number of months, one month apart). The
pd.tseries.offsets.DateOffset(months=1)will ensure that the payment date is on the same day every month, rather than the default end-of-month iffreq="M".Create a new row for each payment date, add a payment column of 10, then pivot so that the
CustomerIDis the column, and the date is the row:Aggregate for month, quarter, and year sales (this will be an aggregation for each individual
CustomerID. You can then sum by row to get a total amount:I realise this may be slow for the
df.applyif you have millions of records, and there may be other ways to complete this, but this is what I have thought of.You will also have a lot of columns if there are many millions of customers, but this way you will keep all the
CustomerIDvalues separate and be able to know which customers made payments in a given month.df["Months"], you could then multiply this by 10 to get the number of sales for each customer.