this is sort of a mathy question...
I had a question prior to this about normalizing monthly data here : How to produce X values of a stretched graph?
I got a good answer and it works well, the only issue is that now I need to check X values of one month with 31 days against X values of a month with 28.
So my question would be: If I have two sets of parameters like so:
x | y x2 | y2
1 | 10 1.0 | 10
2 | 9 1.81 | 9.2
3 | 8 2.63 | 8.6
4 | 7 3.45 | 7.8
5 | 6 4.27 | 7
6 | 5 5.09 | 6.2
7 | 4 5.91 | 5.4
8 | 3 6.73 | 4.2
9 | 2 7.55 | 3.4
10 | 1 8.36 | 2.6
9.18 | 1.8
10.0 | 1.0
As you can see, the general trend is the same for these two data sets. However, if I run these values through a cross-correlation function (the general goal), I will get something back that does not reflect this, since the data sets are of two different sizes.
The real world example of this would be, say, if you are tracking how many miles you run per day:
In February (with 28 days), during the first week, you run one mile each day. During the second week, you run two miles each day, etc.
In March (with 31 days), you do the same thing, but run for one mile for eight days, two miles for eight days, three miles for eight days, and four miles for seven days.
The correlation coefficient according to the following function should be almost exactly 1:
class CrossCorrelator {
def variance = { x->
def v = 0
x.each{ v += it**2}
v/(x.size()) - (mean(x)**2)
}
def covariance = {x, y->
def z = 0
[x, y].transpose().each{ z += it[0] * it[1] }
(z / (x.size())) - (mean(x) * mean(y))
}
def coefficient = {x, y->
covariance(x,y) / (Math.sqrt(variance(x) * variance(y)))
}
}
def i = new CrossCorrelator()
i.coefficient(y values, y2 values)
Just looking at the data sets, it seems like the graphs would be exactly the same if I were to grab the values at 1, 2, 3, 4, 5, 6, 7, 8, 9, and 10, and the function would produce a more accurate result.
However, it's skewed since the lengths are not the same.
Is there some way to locate what the values at the integers in the twelve-value data set would be? I haven't found a simple way to do it, but this would be incredibly helpful.
Thanks in advance,
5
Edit: As per request, here is the code that generates the X values of the graphs:
def x = (1..12)
def y = 10
change = {l, size ->
v = [1]
l.each{
v << ((((size-1)/(x.size() - 1)) * it) + 1)
}
v -= v.last()
return v
}
change(x, y)
Edit: Not working code as per another request:
def normalize( xylist, days ) {
xylist.collect { x, y -> [ x * ( days / xylist.size() ), y ] }
}
def change = {l, size ->
def v = [1]
l.each{
v << ((((size-1)/(l.size() - 1)) * it) + 1)
}
v -= v.last()
return v
}
def resample( list, min, max ) {
// We want a graph with integer points from min to max on the x axis
(min..max).collect { i ->
// find the values above and below this point
bounds = list.inject( [ a:null, b:null ] ) { r, p ->
// if the value is less than i, set it in r.a
if( p[ 0 ] < i )
r.a = p
// if it's bigger (and we don't already have a bigger point)
// then set it into r.b
if( !r.b && p[ 0 ] >= i )
r.b = p
r
}
// so now, bounds.a is the point below our required point, and bounds.b
// Deal with the first case (where a is null, because we are at the start)
if( !bounds.a )
[ i, list[ 0 ][ 1 ] ]
else {
// so work out the distance from bounds.a to bounds.b
dist = ( bounds.b[0] - bounds.a[0] )
// And how far the point i is along this line
r = ( i - bounds.a[0] ) / dist
// and recalculate the y figure for this point
y = ( ( bounds.b[1] - bounds.a[1] ) * r ) + bounds.a[1]
[ i, y ]
}
}
}
def feb = [9, 3, 7, 23, 15, 16, 17, 18, 19, 13, 14, 8, 13, 12, 15, 6, 7, 13, 19, 12, 7, 3, 4, 15, 6, 17, 8, 19]
def march = [8, 12, 4, 17, 11, 15, 12, 8, 9, 13, 12, 7, 3, 4, 8, 2, 17, 19, 21, 12, 12, 13, 14, 15, 16, 7, 8, 19, 21, 14, 16]
//X and Y Values for February
z = [(1..28), change(feb, 28)].transpose()
//X and Y Values for March stretched to 28 entries
o = [(1..31), change(march, 28)].transpose()
o1 = normalize(o, 28)
resample(o1, 1, 28)
If I switch "march" in the o variable declaration to (1..31), the script runs successfully. When I try to use "march," I get " java.lang.NullPointerException: Cannot invoke method getAt() on null object"
Also: I try not to directly copy code just because it's bad practice, so one of the functions I changed basically does the same thing, it's just my version. I'll get around to refactoring the rest of it eventually, too. But that's why it's slightly different.
Ok...here we go...this may not be the cleanest bit of code ever...
Let's first generate two distributions, both from 1 to 10 (in the y axis)
So now, e1 and e2 are:
respectively (to 2dp). Now, using the code from the previous question, we can normalize these to the same x range:
This means n1 and n2 are:
But, as you correctly state they have different numbers of sample points, so cannot be compared easily.
But we can write a method to step through each point we want in our graph, fond the two closest points, and interpolate a y value from the values of these two points like so:
now, the values
final1
andfinal2
are:(obviously, there is some rounding here, so 2d.p. is hiding the fact that they are not exactly the same)
Phew... Must be home-time after that ;-)
EDIT
As pointed out in the edit to the question, there was a bug in my
resample
method that caused it to fail in certain conditions...I believe this has now been fixed in the code above, and from the given example:
If you plot the original 31 points (in
o
) and the new graph of 28 points (inv
), you get:Which doesn't look too bad.
I have no idea what the
change
method was supposed to do, so I have omitted it from this code