I am trying to detect anomalous values in a time series of climatic data with some missing observations. Searching the web I found many available approaches. Of those, stl decomposition seems appealing, in the sense of removing trend and seasonal components and studying the remainder. Reading STL: A Seasonal-Trend Decomposition Procedure Based on Loess, stl appears to be flexible in determining the settings for assigning variability, unaffected by outliers and possible to apply despite missing values. However, trying to apply it in R, with four years of observations and defining all the parameters according to http://stat.ethz.ch/R-manual/R-patched/library/stats/html/stl.html , I encounter error:
time series contains internal NAs
when na.action = na.omit
, and
series is not periodic or has less than two periods
when na.action = na.exclude
.
I have double checked that the frequency is correctly defined. I have seen relevant questions in blogs, but didn't find any suggestion that could solve this. Is it not possible to apply stl in a series with missing values? I am very reluctant to interpolate them, as I do not want to be introducing (and consequently detecting...) artifacts. For the same reason, I do not know how advisable it would be to use ARIMA approaches instead (and if missing values would still be a problem).
Please share if you know a way to apply stl in a series with missing values, or if you believe my choices are methodologically not sound, or if you have any better suggestion. I am quite new in the field and overwhelmed by the heaps of (seemingly...) relevant information.
In the beginning of
stl
we findand soon after that
That is,
stl
expectsx
to bets
object afterna.action(as.ts(x))
(otherwiseperiod == 1
). Let us checkna.omit
andna.exclude
first.Clearly, at the end of
getAnywhere("na.omit.ts")
we findwhich is straightforward and nothing can be done (
na.omit
does not excludeNAs
fromts
objects). NowgetAnywhere("na.exclude.default")
excludesNA
observations, but returns an object of classexclude
:and this is a different situation. As mentioned above,
stl
expectsna.action(as.ts(x))
to bets
, butna.exclude(as.ts(x))
is of classexclude
.Hence if one is satisfied with
NAs
exclusion then e.g.works. In general,
stl
does not work withNA
values (i.e. withna.action = na.pass
), it crashes deeper in Fortran (see full source code here):Alternatives to
na.new
are not delightful:na.contaguous
- finds the longest consecutive stretch of non-missing values in a time series object.na.approx
,na.locf
fromzoo
or some other interpolation function.As we can see in the paper, there is no some simple procedure for missing values (like approximating them in the very beginning) which could be applied to the time series before calling
stl
. So considering the fact that original implementation is quite lengthy I would think about some other alternatives than whole new implementation.Update: a quite optimal in many aspects choice when having
NAs
could bena.approx
fromzoo
, so let us check its performance, i.e. compare results ofstl
with full data set and results when having some number ofNAs
, usingna.approx
. I am using MAPE as a measure of accuracy, but only for trend, because seasonal component and remainder crosses zero and it would distort the result. Positions forNAs
are chosen at random.nottem
, 240 observationsco2
, 468 observationsmdeaths
, 72 observationsVisually we do see some differences in trend in cases 1 and 3. But these differences are pretty small in 1 and also satisfactory in 3 considering sample size (72).