I want to apply Kruskal-Wallis statistical analysis on every numeric column of polars dataframe & return a new dataframe where every column holds the result of the KW.
My dataframe consisting of a large number of rows & columns looks like this.
df=pl.DataFrame({"Group":["A","B","A","A","B"], 'col1':[3,3.2,None, 1,2.3], 'col2':[4,5.4,3.2,1.5,2.3]})
| Group | col1 | col2 |
|---|---|---|
| A | 3.0 | 4.0 |
| B | 3.2 | 5.4 |
| A | NaN | 3.2 |
| A | 1.0 | 1.5 |
| B | 2.3 | 2.3 |
I want to apply KW such that I get this:
| Group | col1 | col2 |
|---|---|---|
| [A,B] | {'Hstats': value1, 'p_val': value2} | {'H_stats': value3, 'p_val': value4} |
value** are the results from kruskal-wallis.
I have tried the following, where I am iterating over every column to get the output. But its too tedious & might be longer computation time when dataframe size is too large.
from scipy.stats import mstats
def kwa(arr):
try: H,p = mstats.kruskalwallis(arr.to_list())
return {'Hstats': H, 'p_val':p}
except Exception: return {'Hstats': 'NA', 'p_val':'NA'}
def calckwa(df):
dft = df.groupby('Group',maintain_order=True).agg(pl.all())
trendcols = dft.get_columns()[1:]
trendskwa = {**{trend.name : kwa(trend) for trend in trendcols}}
return trendskwa
kwa_dict = pl.from_dict(calckwa(df))
I also tried to follow this answer where it has been done with Pandas , however, when I tried to do with Polars like below, I got Panic Exception error.
dfgrp = df.groupby('Group')
newdf = df_grp.apply(lambda grp: grp.drop('Group').apply(kwa))
Throws Error: PanicException: BindingsError: "Could not determine output type" However I also get this error if I just try out with any simple polars dataframe which consists of few numeric columns & has a string "Group".
So, can anyone help me how to apply kruskal-wallis to all the columns of a polars dataframe without iterating over them, without having the need to convert it to pandas dataframe (since if dataframe size is already large, converting to pandas will take longer)?
The simplest thing is to just use
pl.reducelike this:You can do it in a
groupby().aggcontext as you'd expectTo get the results in their own named columns you can use
to_structfollowed byunnestlike this...Note: In OP's question, there's a parameter named arr. If you're on an old version of polars then the
.listin the above has recently supplanted.arrbut, for this purpose, it should just be changing the namespace with everything else staying the same.