Most common term in a vector - PARI/GP

203 Views Asked by At

I feel like I'm being really stupid here as I would have thought there's a simple command already in Pari, or it should be a simple thing to write up, but I simply cannot figure this out.

Given a vector, say V, which will have duplicate entries, how can one determine what the most common entry is?

For example, say we have: V = [ 0, 1, 2, 2, 3, 4, 6, 8, 8, 8 ]

I want something which would return the value 8.

I'm aware of things like vecsearch, but I can't see how that can be tweaked to make this work?


Very closely related to this, I want this result to return the most common non-zero entry, and some vectors I look at will have 0 as the most common entry. Eg: V = [ 0, 0, 0, 0, 3, 3, 5 ]. So whatever I execute here I would like to return 3. I tried writing up something which would remove all zero terms, but again struggled.

The thing I have tried in particular is:

rem( v ) = {
my( c );
while( c = vecsearch( v, 0 ); #c, v = vecextract( v, "^c" ) ); v
}

but vecextract doesn't seem to like this set up.

2

There are 2 best solutions below

2
On BEST ANSWER

If you can ensure all the elements are within the some fixed range then it is enough just to do the counting sorting with PARI/GP code like this:

counts_for(v: t_VEC, lower: t_INT, upper: t_INT) = {
    my(counts = vector(1+upper-lower));

    for(i=1, #v, counts[1+v[i]-lower]++);
    vector(#counts, i, [i-1, counts[i]])
};

V1 = [0, 1, 2, 2, 3, 4, 6, 8, 8, 8];
vecsort(counts_for(V1, 0, 8), [2], 4)[1][1]
> 8

V2 = [0, 0, 0, 0, 3, 3, 5];
vecsort(counts_for(V2, 0, 5), [2], 4)[1][1]
> 0

You also can implement the following short-cut for the sake of convenience:

counts_for1(v: t_VEC) = {
    counts_for(v, vecmin(v), vecmax(v))
};

most_frequent(v: t_VEC) = {
    my(counts=counts_for1(v));
    vecsort(counts, [2], 4)[1][1]
};

most_frequent(V1)
> 8

most_frequent(V2)
> 0
0
On

The function matreduce provides this in a more general setting: applied to a vector of objects, it returns a 2-column matrix whose first column contains the distinct objects and the second their multiplicity in the vector. (The function has a more general form that takes the union of multisets.)

 most_frequent(v) = my(M = matreduce(v), [n] = matsize(M)); M[n, 1];

 most_frequent_non0(v) = 
 { my(M = matreduce(v), [n] = matsize(M), x = M[n, 1]);
   if (x == 0, M[n - 1, 1], x);
 }
 
 ? most_frequent([ 0, 1, 2, 2, 3, 4, 6, 8, 8, 8 ])
 %1 = 8
 ? most_frequent([x, x, Mod(1,3), [], [], []])
 %2 = []
 ? most_frequent_non0([ 0, 0, 0, 0, 3, 3, 5 ])
 %3 = 5
 ? most_frequent_non0([x, x, Mod(1,3), [], [], []])
 %4 = x

The first function will error out if fed an empty vector, and the second one if there are no non-zero entries. The second function tests for "0" using the x == 0 test (and we famously have [] == 0 in GP); for a more rigorous semantic, use x === 0 in the function definition.