I am trying to calculate the sum of volume for the last thirty days for a set of stocks on particular days in the table important_stock_dates
. The table all_stock_dates
contains the same stocks but with trading volume for all dates, not just the particular days.
Sample data
all_stock_dates
stockid, date, volume
0231245, 20060314, 153
0231245, 20060315, 154
2135411, 20060314, 23
important_stock_dates
stockid, date, thirtydaysprior
0231245, 20060314, 20060130
0231245, 20060315, 20060201
2135411, 20060314, 20060130
My code
create table sum_trading_volume as
select a.stockid, a.date, sum(b.volume) as thirty_day_volume
from important_stock_dates a, all_stock_dates b
where b.date<a.date AND b.date ge a.thirtydaysprior
group by a.stockid, a.date;
Desired outcome
A table with all the observations from important_stock_dates
that also has the sum of the volume from the previous 30 days based on matching stockid and dates in all_stock_dates
.
Problem
The problem I'm running into is that important_stock_dates
has 15 million observations and all_stock_dates
has 350 million. It uses up a few hundred gigabytes of swap file running this code (maxes out the hard drive) then aborts. I can't see how to optimize the code. I couldn't find a similar problem on StackOverflow or Google.
Presumably, the query that you want joins on
stockid
:If this worked, it will probably run to completion.