T-test for groups within a Pandas dataframe for a unique id

390 Views Asked by At

I have the following dataframe and I am performing a t-test between all days of the weekday and all days of a weekend in a month for every ID.

> +-----+------------+-----------+---------+-----------+ | id  | usage_day  | dow       | tow     | daily_avg |
> +-----+------------+-----------+---------+-----------+ | c96 | 01/09/2020 | Tuesday   | week    | 393.07    |
> +-----+------------+-----------+---------+-----------+ | c96 | 02/09/2020 | Wednesday | week    | 10.38     |
> +-----+------------+-----------+---------+-----------+ | c96 | 03/09/2020 | Thursday  | week    | 429.35    |
> +-----+------------+-----------+---------+-----------+ | c96 | 04/09/2020 | Friday    | week    | 156.20    |
> +-----+------------+-----------+---------+-----------+ | c96 | 05/09/2020 | Saturday  | weekend | 346.22    |
> +-----+------------+-----------+---------+-----------+ | c96 | 06/09/2020 | Sunday    | weekend | 106.53    |
> +-----+------------+-----------+---------+-----------+ | c96 | 08/09/2020 | Tuesday   | week    | 194.74    |
> +-----+------------+-----------+---------+-----------+ | c96 | 10/09/2020 | Thursday  | week    | 66.30     |
> +-----+------------+-----------+---------+-----------+ | c96 | 17/09/2020 | Thursday  | week    | 163.84    |
> +-----+------------+-----------+---------+-----------+ | c96 | 18/09/2020 | Friday    | week    | 261.81    |
> +-----+------------+-----------+---------+-----------+ | c96 | 19/09/2020 | Saturday  | weekend | 410.30    |
> +-----+------------+-----------+---------+-----------+ | c96 | 20/09/2020 | Sunday    | weekend | 266.28    |
> +-----+------------+-----------+---------+-----------+ | c96 | 23/09/2020 | Wednesday | week    | 346.18    |
> +-----+------------+-----------+---------+-----------+ | c96 | 24/09/2020 | Thursday  | week    | 20.67     |
> +-----+------------+-----------+---------+-----------+ | c96 | 25/09/2020 | Friday    | week    | 222.23    |
> +-----+------------+-----------+---------+-----------+ | c96 | 26/09/2020 | Saturday  | weekend | 449.84    |
> +-----+------------+-----------+---------+-----------+ | c96 | 27/09/2020 | Sunday    | weekend | 438.47    |
> +-----+------------+-----------+---------+-----------+ | c96 | 28/09/2020 | Monday    | week    | 10.44     |
> +-----+------------+-----------+---------+-----------+ | c96 | 29/09/2020 | Tuesday   | week    | 293.59    |
> +-----+------------+-----------+---------+-----------+ | c96 | 30/09/2020 | Wednesday | week    | 194.49    |
> +-----+------------+-----------+---------+-----------+

My script is as follows, but it is unfortunately too slow and not the pandas way of doing things. How I could do it more efficiently?

    from scipy.stats import ttest_ind, ttest_ind_from_stats

    p_val = []
    stat_flag = []
    all_ids = df.id.unique()
    alpha = 0.05
    print(len(all_ids))
    for id in all_ids:
        t = df[df.id ==  id]
        group1 = t[t.tow == 'week']
        group2 = t[t.tow == 'weekend']
        t, p_value_ttest = ttest_ind(group1.daily_avg, group2.daily_avg, equal_var=False)
        if p_value_ttest < alpha:
           p_val.append(p_value_ttest)
           stat_flag.append(1)
        else: 
           p_val.append(p_value_ttest)
           stat_flag.append(0)

p-val gives the p-values for every id.

2

There are 2 best solutions below

0
On BEST ANSWER

Dataset

Based on dataset you provided:

import io
from scipy import stats
import pandas as pd

s = """id|usage_day|dow|tow|daily_avg
c96|01/09/2020|Tuesday|week|393.07
c96|02/09/2020|Wednesday|week|10.38
c96|03/09/2020|Thursday|week|429.35
c96|04/09/2020|Friday|week|156.20
c96|05/09/2020|Saturday|weekend|346.22
c96|06/09/2020|Sunday|weekend|106.53
c96|08/09/2020|Tuesday|week|194.74
c96|10/09/2020|Thursday|week|66.30
c96|17/09/2020|Thursday|week|163.84
c96|18/09/2020|Friday|week|261.81
c96|19/09/2020|Saturday|weekend|410.30
c96|20/09/2020|Sunday|weekend|266.28
c96|23/09/2020|Wednesday|week|346.18
c96|24/09/2020|Thursday|week|20.67
c96|25/09/2020|Friday|week|222.23
c96|26/09/2020|Saturday|weekend|449.84
c96|27/09/2020|Sunday|weekend|438.47
c96|28/09/2020|Monday|week|10.44
c96|29/09/2020|Tuesday|week|293.59
c96|30/09/2020|Wednesday|week|194.49"""
df = pd.read_csv(io.StringIO(s), sep='|')

I add a new id with similar data for groupby clarity sake:

df2 = df.copy()
df2['id'] = 'c97'
df = pd.concat([df, df2])

MCVE

You do not have to resort to any explicit loop, instead take advantage of the apply method which operates on frames and also works with groupby.

To do that, we define a function performing the desired test on a DataFrame (groupby will call this method for each sub dataframe corresponding to combination of grouped keys):

def ttest(x):
    g = x.groupby('tow').agg({'daily_avg': list})
    r = stats.ttest_ind(g.loc['week', 'daily_avg'], g.loc['weekend', 'daily_avg'], equal_var=False)
    s = {k: getattr(r, k) for k in r._fields}
    return pd.Series(s)

Then it suffices to chain apply after the groupby call:

T = df.groupby('id').apply(ttest)

Results is about:

     statistic    pvalue
id                      
c96  -2.128753  0.059126
c97  -2.128753  0.059126

Refactoring

Once you have understood the power of this methodology, you can refactor the above code into a reusable function such as:

def ttest(x, y):
    return stats.ttest_ind(x, y, equal_var=False)

def apply_test(x, subgroup='tow', value='daily_avg', key1='week', key2='weekend', test=ttest):
    g = x.groupby(subgroup).agg({value: list})
    r = test(g.loc[key1, value], g.loc[key2, value])
    return pd.Series({k: getattr(r, k) for k in r._fields})

T = df.groupby('id').apply(apply_test, subgroup='anotherbucket', key1='experience', key2='reference', value='threshold')

Which allow you to adapt statistical tests and DataFrame columns with respect to your needs.

2
On

I cannot benchmark without sample data, but maybe you could try a groupby instead of your for loop:

for id,t in df.groupby('id'):    
    group1 = t[t.tow == 'week']
    group2 = t[t.tow == 'weekend']
    t, p_value_ttest = ttest_ind(group1.daily_avg, group2.daily_avg, equal_var=False)
    if p_value_ttest < alpha:
        p_val.append(p_value_ttest)
        stat_flag.append(1)
    else: 
        p_val.append(p_value_ttest)
        stat_flag.append(0)