Pandas Cumulative Sum of Difference Between Value Counts in Two Dataframe Columns

Question

Pandas Cumulative Sum of Difference Between Value Counts in Two Dataframe Columns

2.7k Views Asked by SystemTheory At 04 April 2025 at 01:34

The charts below show my basic challenge: subtract NUMBER OF STOCKS WITH DATA END from NUMBER OF STOCKS WITH DATA START. The challenge I am having is that the date range for each series does not match so I need to merge both sets to a common date range, perform the subtraction, and save results to a new comma seperated value file.

Input data in file named 'meta.csv' contains 3187 lines. Fields per line are data for ticker, start, & end. Head and tail as shown here:

0000 ticker,start,end
0001 A,1999-11-18,2016-12-27
0002 AA,2016-11-01,2016-12-27
0003 AAL,2005-09-27,2016-12-27
0004 AAMC,2012-12-13,2016-12-27
0005 AAN,1984-09-07,2016-12-27
...
3183 ZNGA,2011-12-16,2016-12-27
3184 ZOES,2014-04-11,2016-12-27
3185 ZQK,1990-03-26,2015-09-09
3186 ZTS,2013-02-01,2016-12-27
3187 ZUMZ,2005-05-06,2016-12-27

Python code and console output:

import pandas as pd

df = pd.read_csv('meta.csv')
s = df.groupby('start').size().cumsum()
e = df.groupby('end').size().cumsum()
#s.plot(title='NUMBER OF STOCKS WITH DATA START',
#       grid=True,style='k.')
#e.plot(title='NUMBER OF STOCKS WITH DATA END',
#       grid=True,style='k.')

print(s.head(5))
print(s.tail(5))
print(e.tail(5))

OUT:
start
1962-01-02    11
1962-11-19    12
1970-01-02    30
1971-08-06    31
1972-06-01    54
dtype: int64
start
2016-07-05    3182
2016-10-04    3183
2016-11-01    3184
2016-12-05    3185
2016-12-08    3186
end
2016-12-08     544
2016-12-15     545
2016-12-16     546
2016-12-21     547
2016-12-27    3186
dtype: int64

Chart output when comments removed for code shown above:

I want to create one population file with the date and number of stocks with active data which should have a head and tail shown as follows:

date,num_stocks
1962-01-02,11
1962-11-19,12
1970-01-02,30
1971-08-06,31
1972-06-01,54
...
2016-12-08,2642
2016-12-15,2641
2016-12-16,2640
2016-12-21,2639
2016-12-27,2639

The ultimate goal is to be able to plot the number of stocks with data over any specified date range by reading the population file.

Original Q&A

There are 2 best solutions below

piRSquared On 01 January 2017 at 00:28

To align the dates with their respective counts. I'd take the difference of pd.Series.value_counts

df.start.value_counts().sub(df.end.value_counts(), fill_value=0)

1984-09-07    1.0
1990-03-26    1.0
1999-11-18    1.0
2005-05-06    1.0
2005-09-27    1.0
2011-12-16    1.0
2012-12-13    1.0
2013-02-01    1.0
2014-04-11    1.0
2015-09-09   -1.0
2016-11-01    1.0
2016-12-27   -9.0
dtype: float64

**SystemTheory** · Accepted Answer

Thanks to the crucial tip provided by piRSquared I solved the challenge using this code:

import pandas as pd

df = pd.read_csv('meta.csv')
x = df.start.value_counts().sub(df.end.value_counts(), fill_value=0)
x.iloc[-1] = 0
r = x.cumsum()
r.to_csv('pop.csv')
z = pd.read_csv('pop.csv', index_col=0, header=None)
z.plot(title='NUMBER OF STOCKS WITH DATA',legend=None,
       grid=True,style='k.')

'pop.csv' file head/tail:

1962-01-02  11.0
1962-11-19  12.0
1970-01-02  30.0
1971-08-06  31.0
1972-06-01  54.0
...
2016-12-08  2642.0
2016-12-15  2641.0
2016-12-16  2640.0
2016-12-21  2639.0
2016-12-27  2639.0

Chart:

Pandas Cumulative Sum of Difference Between Value Counts in Two Dataframe Columns

There are 2 best solutions below

Related Questions in PYTHON-3.X

Related Questions in PANDAS

Related Questions in DATAFRAME

Related Questions in SERIES

Trending Questions

Popular # Hahtags

Popular Questions