I am trying to use Earthmover's distance in R via the emdist
package to find the distance between two distributions with unlike numbers of observations. My understanding is that EMD can be used for different size distributions, and indeed this is the purpose to find the distance when the distributions have different parameters.
However, I have tried to run multiple codes, and in all cases I receive the error:
Error in emd2d(sample, sample2, dist = "manhattan") :
A and B must be matrices of the same dimensions
My codes is the following:
library(emdist)
sample = structure(c(6, 4, 5, 6, 4, 0, 5, 6, 3, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1), dim = c(10L, 2L), dimnames = list(NULL, c("variable",
"weight")))
sample2 = structure(c(0.25, 0.75, 0, 0.75, 0.75, 0.75, 0, 0, 0.75, 0.25,
0, 0.25, 0.75, 0.75, 0.25, 0.25, 0, 0.75, 0, 0.75, 0.75, 0.75,
0.25, 0.75, 1, 0.25, 0.75, 0.25, 0.25, 0.25, 0.25, 0.75, 0, 0.75,
0.75, 1, 0.75, 0.75, 0.75, 0.75, 0.75, 0.75, 0.25, 0.25, 0.25,
0, 1, 0.25, 0.25, 0.75, 0.75, 0.25, 0.25, 1, 0, 0, 0.25, 0.25,
0.25, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), dim = c(59L,
2L), dimnames = list(NULL, c("variable", "weight")))
emd2d(sample,sample2)
which produces the error:
Error in emd2d(sample, sample2, dist = "manhattan") :
A and B must be matrices of the same dimensions
I tried transposing the matrices so that they match the example from the package vignette:
emd2d(t(sample),t(sample2))
but of course this does not make the dimensions the same, so it doesn't change anything.
I also tried using the emd
and emdr
functions instead, but this consistently returns 0 as the distance, which I don't think can be correct, as I have changed the values multiple times but do not get a different output:
emd(sample)
emdr(sample2)
Am I programming the EMD wrongly? It is difficult to tell from the package manual how to apply it to distributions with different observation numbers. Can this be done with the emdist package? Am I misunderstanding what the matrices' structures represent?
If helpful, the weights included in the sample code are not important; I do not need them, my understanding was that these must be specified in the EMD syntax, so I included them all as the same. I only intend to calculate EMD between sample's "variable" and sample2's "variable".
As I understand, your distributions are uniformly distributed (
weight=1
) on thevariable
column.You can use the kantorovich package, up to a problem... I am the author and I've just realized my code throws an error when the two distributions do not have the same lengths, which is stupid... So for the meantime (I will fix that), you can copy the code of
kantorovich_CVXR
and comment the lines throwing this error:And you can get the distance as follows:
Edit: using 'emdist' package
You just have to switch the two columns:
But there's a problem: this does not yield the same result as kantorovich... I'm working on it.
Edit
I get the same result as kantorovich if I use
emdw
. I believe there's a bug inemd
.