Pandas calculating sales for recurring monthly payments

1.1k Views Asked by At

I have a dataset with millions of records just like below

CustomerID StartTime EndTime
1111 2015-7-10 2016-3-7
1112 2016-1-5 2016-1-19
1113 2015-10-18 2020-9-1

This dataset contains the information for different subscription contracts and it is assumed that:

  • if the contract is active then the customer will need to pay a monthly fee in advance. The first payment will be collected on the start date.
  • If the contract ends before the next payment date, which is exactly one month after the last payment date, the customer does not need to pay the next subscription. For instance, customer 1112 only needs to pay once.
  • monthly payment fee is $10

In this situation, I need to calculate the monthly/quarterly/annual sales between 2015 and 2020. It is ideal to also show the breakdown of sales by different customer IDs so that subsequent machine learning tasks can be performed.

1

There are 1 best solutions below

0
Rawson On

Importing data (I saved your table as a .csv in Excel, which is the reason for the specific formatting of the pd.to_datetime):

import pandas as pd
import numpy as np

df = pd.read_csv("Data.csv", header=0)
# convert columns "to_datetime"
df["StartTime"] = pd.to_datetime(df["StartTime"], format="%d/%m/%Y")
df["EndTime"] = pd.to_datetime(df["EndTime"], format="%d/%m/%Y")

Calculate the number of months between the start and end dates (+1 at the end because there will be a payment even if the contract is not active for a whole month, because it is in advance):

df["Months"] = ((df["EndTime"] - df["StartTime"])/np.timedelta64(1, 'M')).astype(int) + 1

Generate a list of payment dates (from the start date, for the given number of months, one month apart). The pd.tseries.offsets.DateOffset(months=1) will ensure that the payment date is on the same day every month, rather than the default end-of-month if freq="M".

df["PaymentDates"] = df.apply(lambda x: list(pd.date_range(start=x["StartTime"], periods=x["Months"], freq=pd.tseries.offsets.DateOffset(months=1)).date), axis=1)

Create a new row for each payment date, add a payment column of 10, then pivot so that the CustomerID is the column, and the date is the row:

df = df.explode("PaymentDates").reset_index(drop=True)
df["PaymentDates"] = pd.to_datetime(df["PaymentDates"])
df["Payment"] = 10
df = pd.pivot_table(df, index="PaymentDates", columns="CustomerID", values="Payment")

Aggregate for month, quarter, and year sales (this will be an aggregation for each individual CustomerID. You can then sum by row to get a total amount:

months = df.groupby([df.index.year, df.index.month]).sum()
quarters = df.groupby([df.index.year, df.index.quarter]).sum()
years = df.groupby(df.index.year).sum()
# total sales
months["TotalSales"] = months.sum(axis=1)
quarters["TotalSales"] = quarters.sum(axis=1)
years["TotalSales"] = years.sum(axis=1)

I realise this may be slow for the df.apply if you have millions of records, and there may be other ways to complete this, but this is what I have thought of.

You will also have a lot of columns if there are many millions of customers, but this way you will keep all the CustomerID values separate and be able to know which customers made payments in a given month.

  • After the number of months is calculated in df["Months"], you could then multiply this by 10 to get the number of sales for each customer.
  • If this is the only data you need for each customer individually, you would not need to pivot the data at all, just aggregate on the "PaymentDates" column, count the number of rows and multiply by 10 to get the sales for month, quarter, year.