I want to apply Kruskal-Wallis statistical analysis on every numeric column of polars dataframe & return a new dataframe where every column holds the result of the KW.

My dataframe consisting of a large number of rows & columns looks like this.

df=pl.DataFrame({"Group":["A","B","A","A","B"], 'col1':[3,3.2,None, 1,2.3], 'col2':[4,5.4,3.2,1.5,2.3]})
Group col1 col2
A 3.0 4.0
B 3.2 5.4
A NaN 3.2
A 1.0 1.5
B 2.3 2.3

I want to apply KW such that I get this:

Group col1 col2
[A,B] {'Hstats': value1, 'p_val': value2} {'H_stats': value3, 'p_val': value4}

value** are the results from kruskal-wallis.

I have tried the following, where I am iterating over every column to get the output. But its too tedious & might be longer computation time when dataframe size is too large.

from scipy.stats import mstats
def kwa(arr): 
    try: H,p = mstats.kruskalwallis(arr.to_list()) 
        return {'Hstats': H, 'p_val':p} 
    except Exception: return {'Hstats': 'NA', 'p_val':'NA'}
def calckwa(df): 
    dft = df.groupby('Group',maintain_order=True).agg(pl.all()) 
    trendcols = dft.get_columns()[1:] 
    trendskwa = {**{trend.name : kwa(trend) for trend in trendcols}}
    return trendskwa

kwa_dict = pl.from_dict(calckwa(df))

I also tried to follow this answer where it has been done with Pandas , however, when I tried to do with Polars like below, I got Panic Exception error.

dfgrp = df.groupby('Group') 
newdf = df_grp.apply(lambda grp: grp.drop('Group').apply(kwa))

Throws Error: PanicException: BindingsError: "Could not determine output type" However I also get this error if I just try out with any simple polars dataframe which consists of few numeric columns & has a string "Group".

So, can anyone help me how to apply kruskal-wallis to all the columns of a polars dataframe without iterating over them, without having the need to convert it to pandas dataframe (since if dataframe size is already large, converting to pandas will take longer)?

1

There are 1 best solutions below

1
Dean MacGregor On

The simplest thing is to just use pl.reduce like this:

from scipy.stats.mstats import kruskalwallis
df.select(pl.reduce(kruskalwallis, ('col1','col2')))

You can do it in a groupby().agg context as you'd expect

df.groupby("Group").agg(pl.reduce(kruskalwallis, ('col1','col2')))

To get the results in their own named columns you can use to_struct followed by unnest like this...

(
    df
        .groupby("Group")
        .agg(krusk=pl.reduce(kruskalwallis, ('col1','col2')))
        .with_columns(pl.col('krusk').list.to_struct(fields=['Hstats','p_val']))
        .unnest('krusk')
        )

Note: In OP's question, there's a parameter named arr. If you're on an old version of polars then the .list in the above has recently supplanted .arr but, for this purpose, it should just be changing the namespace with everything else staying the same.