Sql cartesian product (summing with group by)

1.3k Views Asked by At

I am trying to calculate the sum of volume for the last thirty days for a set of stocks on particular days in the table important_stock_dates. The table all_stock_dates contains the same stocks but with trading volume for all dates, not just the particular days.

Sample data

all_stock_dates

stockid, date, volume  
0231245, 20060314, 153  
0231245, 20060315, 154  
2135411, 20060314, 23  

important_stock_dates

stockid, date, thirtydaysprior  
0231245, 20060314, 20060130  
0231245, 20060315, 20060201  
2135411, 20060314, 20060130  

My code

create table sum_trading_volume as
select a.stockid, a.date, sum(b.volume) as thirty_day_volume
from important_stock_dates a, all_stock_dates b
where b.date<a.date AND b.date ge a.thirtydaysprior
group by a.stockid, a.date;

Desired outcome

A table with all the observations from important_stock_dates that also has the sum of the volume from the previous 30 days based on matching stockid and dates in all_stock_dates.

Problem

The problem I'm running into is that important_stock_dates has 15 million observations and all_stock_dates has 350 million. It uses up a few hundred gigabytes of swap file running this code (maxes out the hard drive) then aborts. I can't see how to optimize the code. I couldn't find a similar problem on StackOverflow or Google.

1

There are 1 best solutions below

3
On BEST ANSWER

Presumably, the query that you want joins on stockid:

create table sum_trading_volume as
    select isd.stockid, isd.date, sum(asd.volume) as thirty_day_volume
    from important_stock_dates isd join
         all_stock_dates asd
         on isd.stockid = asd.stockid and
            asd.date < isd.date and asd.date >= isd.thirtydaysprior
    group by isd.stockid, isd.date;

If this worked, it will probably run to completion.