I'm trying to detect patterns from open-high-low-close (OHLC) data, so here is what I did:
- Find local minima and maxima on the dataset
- Normalize my data by converting the array of local minima and maxima to an array of numbers, where every number is the variation from the previous point.
Until now, everything works, but I got stuck on the following part. I defined an array of data, which is a pattern, that when is plotted on a chart will have a certain shape. I'm now trying to find, on other datasets, shapes that are similar to the pattern I specified.
Here is the pattern specified by me:
Pattern = [7.602339181286544, 3.5054347826086927, -5.198214754528746, 4.7078371642204315, -2.9357312880190425, 2.098092643051778, -0.5337603416066172]
And here is a sample dataset:
SampleTarget = [-2.2538552787663173, -3.00364077669902, 2.533625273694082, -2.2574740695546116, 3.027465667915112, 6.4222962738564, -2.647309991460278, 7.602339181286544, 3.5054347826086927, -5.198214754528746, 4.7078371642204315, -2.9357312880190425, 2.098092643051778, -0.5337603416066172, 4.212503353903944, -2.600411946446969, 8.511763150938416, -3.775883069427527, 1.8227848101265856, 3.6300348085529524, -1.4635316698656395, 5.527148770392016, -1.476695892939546, 12.248243559718961, -4.443980805341117, 1.9213973799126631, -9.061696658097686, 5.347467608951697, -2.8622540250447197, 2.6012891344383067]
I'm looking for a way to detect when, at a certain point, on SampleTarget
, is spotted a series of values that are similar to Pattern
.
In this case, for example, I need to detect, somehow, that there is a part of SampleTarget
where the values are similar to Pattern
, since it's the same dataset from which i extracted Pattern
.
What I tried:
I've been suggested to use numpy.correlate
, python-dtw
(Dynamic time warping), or stumpy but the problem I encountered with those is the lack of practical examples on this particular matter.
Here is a trick to do it:
Output:
You can use
np.where
ornp.argwhere
to get the index of the match(es). You can tune theatol
andrtol
parameters ofnp.isclose
to set the threshold for an approximate match.Clarification: if you do the
as_strided
trick ondata=np.arange(30)
, thendata2d
will be:EDIT: This is an efficient way to create a view of the same data with a sliding windows, without requiring extra memory. A numpy array lookup
a[i, j]
finds the memory address asstart_address + a.strides[0]*i + a.strides[1]*j
; by setting strides to(8, 8)
, where 8 is the size of a float value, you achieve the sliding-window effect. Because different array elements refer to the same memory, it's best to treat an array constructed this way as read-only.EDIT: if you want to have a "score" metric for the quality of the match, you can for example do this:
closer to zero means a better match. Here,
norm
takes the length of the difference vectord=data-pat
, i.e.,sqrt(d[0]**2 + ... + d[m-1]**2)
.EDIT: If you are interested in patterns that have the same shape, but are scaled to a larger or smaller value, you can do this:
Result:
You see that
cofs[7] == 1.1
, meaning that the pattern had to be scaled by a factor 1.1 on the corresponding data window for a best fit. The fit was perfect, which you can see fromssqr[7] == 0
. It also finds the other one, withcofs[16] == 0.52
(close to the expected 0.5 value) andssqr[16] == 0.7
.Other example:
cofs[21]==-0.91
andssqr[12]==235.1
. This means thatdata_mod[12:19]
somewhat resembles the pattern, but inverted (positive and negative swapped). It depends on what you want to do with the data; most likely you'd like to look atcofs
values in the range 0.5 to 2: your search pattern is allowed to occur in the data a factor 2 larger or smaller. This should be combined with sufficiently smallssqr
values.Here you see the three potential matches in a graph:
If you use
ssqr
as a score metric, be aware that a series of zeros in the input will result incofs=0
andssqr=0
.Consider using
np.sqrt(ssqr/m)/np.abs(cofs)
as a metric instead, for two reasons. (1) it will match according to relative error and result inNaN
values in the case of zero input. (2) it is more intuitive; if the value is 0.5, it means that the data points deviate by about 0.5 from the pattern values. Here is are the values for this metric, using the same example data:For the match at
data_mod[21:28]
, the difference metric is 6.4, which corresponds roughly to the differences as seen in the plot.