What is the quickest way to apply a statistical analysis such as Kruskal-Wallis from scipy module in python to an entire polars dataframe?

198 Views Asked by megha At 14 June 2023 at 09:07

I want to apply Kruskal-Wallis statistical analysis on every numeric column of polars dataframe & return a new dataframe where every column holds the result of the KW.

My dataframe consisting of a large number of rows & columns looks like this.

df=pl.DataFrame({"Group":["A","B","A","A","B"], 'col1':[3,3.2,None, 1,2.3], 'col2':[4,5.4,3.2,1.5,2.3]})

Group	col1	col2
A	3.0	4.0
B	3.2	5.4
A	NaN	3.2
A	1.0	1.5
B	2.3	2.3

I want to apply KW such that I get this:

Group	col1	col2
[A,B]	{'Hstats': value1, 'p_val': value2}	{'H_stats': value3, 'p_val': value4}

value** are the results from kruskal-wallis.

I have tried the following, where I am iterating over every column to get the output. But its too tedious & might be longer computation time when dataframe size is too large.

from scipy.stats import mstats
def kwa(arr): 
    try: H,p = mstats.kruskalwallis(arr.to_list()) 
        return {'Hstats': H, 'p_val':p} 
    except Exception: return {'Hstats': 'NA', 'p_val':'NA'}

def calckwa(df): 
    dft = df.groupby('Group',maintain_order=True).agg(pl.all()) 
    trendcols = dft.get_columns()[1:] 
    trendskwa = {**{trend.name : kwa(trend) for trend in trendcols}}
    return trendskwa

kwa_dict = pl.from_dict(calckwa(df))

I also tried to follow this answer where it has been done with Pandas , however, when I tried to do with Polars like below, I got Panic Exception error.

dfgrp = df.groupby('Group') 
newdf = df_grp.apply(lambda grp: grp.drop('Group').apply(kwa))

Throws Error: PanicException: BindingsError: "Could not determine output type" However I also get this error if I just try out with any simple polars dataframe which consists of few numeric columns & has a string "Group".

So, can anyone help me how to apply kruskal-wallis to all the columns of a polars dataframe without iterating over them, without having the need to convert it to pandas dataframe (since if dataframe size is already large, converting to pandas will take longer)?

Original Q&A

There are 1 best solutions below

Dean MacGregor On 14 June 2023 at 09:29

The simplest thing is to just use pl.reduce like this:

from scipy.stats.mstats import kruskalwallis
df.select(pl.reduce(kruskalwallis, ('col1','col2')))

You can do it in a groupby().agg context as you'd expect

df.groupby("Group").agg(pl.reduce(kruskalwallis, ('col1','col2')))

To get the results in their own named columns you can use to_struct followed by unnest like this...

(
    df
        .groupby("Group")
        .agg(krusk=pl.reduce(kruskalwallis, ('col1','col2')))
        .with_columns(pl.col('krusk').list.to_struct(fields=['Hstats','p_val']))
        .unnest('krusk')
        )

Note: In OP's question, there's a parameter named arr. If you're on an old version of polars then the .list in the above has recently supplanted .arr but, for this purpose, it should just be changing the namespace with everything else staying the same.

What is the quickest way to apply a statistical analysis such as Kruskal-Wallis from scipy module in python to an entire polars dataframe?

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in PERFORMANCE

Related Questions in PYTHON-POLARS

Related Questions in SCIPY.STATS

Related Questions in KRUSKAL-WALLIS

Trending Questions

Popular # Hahtags

Popular Questions