I have a dataset and it has some missing values. It is a time series dataset so I want to update the missing values with the value from the previous day.
I created a view that contains the lag of the value of the two previous days. See below:
| sale_date | daily_op | lag_op1 | lag_op2 |
|---|---|---|---|
| 2013-01-03 | 92.97 | 93.14 | null |
| 2013-01-04 | 93.12 | 92.97 | 93.14 |
| 2013-01-05 | null | 93.12 | 92.97 |
My main table contains 3 million records and I want to use the view with the lagged op variables to impute the missing values in my main table.
I ran this code:
update wholetest
set daily_op = (select lag_op1 from lag_op where lag_op.sale_date = wholetest.sale_date)
where daily_op is null;
And it was taking a while so I ran to target and I came back and it was still running. It had a run time of more than 1 hour, so I don't think it is working correctly even if it has a lot of records.
I tried adding the 'limit 1' option after wholetest.sale_date but that did not seem to work. (i.e. the view with the lagged variables has one distinct date). Please let me know what I am doing wrong with my code.
Is there a unique index (or PK) on
wholetest.sale_date? Else you might get arbitrary results - at added cost proportional to the number of duplicates. In other words: expensive nonsense.With said unique index, the query should be fast - well a table with 3 million rows will take a couple seconds. (We can assume that most rows don't need an update?!) Unless you messed up your undisclosed view. Either way, you shouldn't need a view for this. Just:
If you have consecutive entries with null values, you need to do more.
The only other thing (besides breakage or serious contention for storage, RAM and/or CPU) would be locking issues. Typically concurrent writes to the tables in long running transactions. Check the system view
pg_stat_activityin this case. See: