T-test for groups within a Pandas dataframe for a unique id

Question

T-test for groups within a Pandas dataframe for a unique id

402 Views Asked by Adam Merckx At 06 June 2025 at 21:37

I have the following dataframe and I am performing a t-test between all days of the weekday and all days of a weekend in a month for every ID.

> +-----+------------+-----------+---------+-----------+ | id  | usage_day  | dow       | tow     | daily_avg |
> +-----+------------+-----------+---------+-----------+ | c96 | 01/09/2020 | Tuesday   | week    | 393.07    |
> +-----+------------+-----------+---------+-----------+ | c96 | 02/09/2020 | Wednesday | week    | 10.38     |
> +-----+------------+-----------+---------+-----------+ | c96 | 03/09/2020 | Thursday  | week    | 429.35    |
> +-----+------------+-----------+---------+-----------+ | c96 | 04/09/2020 | Friday    | week    | 156.20    |
> +-----+------------+-----------+---------+-----------+ | c96 | 05/09/2020 | Saturday  | weekend | 346.22    |
> +-----+------------+-----------+---------+-----------+ | c96 | 06/09/2020 | Sunday    | weekend | 106.53    |
> +-----+------------+-----------+---------+-----------+ | c96 | 08/09/2020 | Tuesday   | week    | 194.74    |
> +-----+------------+-----------+---------+-----------+ | c96 | 10/09/2020 | Thursday  | week    | 66.30     |
> +-----+------------+-----------+---------+-----------+ | c96 | 17/09/2020 | Thursday  | week    | 163.84    |
> +-----+------------+-----------+---------+-----------+ | c96 | 18/09/2020 | Friday    | week    | 261.81    |
> +-----+------------+-----------+---------+-----------+ | c96 | 19/09/2020 | Saturday  | weekend | 410.30    |
> +-----+------------+-----------+---------+-----------+ | c96 | 20/09/2020 | Sunday    | weekend | 266.28    |
> +-----+------------+-----------+---------+-----------+ | c96 | 23/09/2020 | Wednesday | week    | 346.18    |
> +-----+------------+-----------+---------+-----------+ | c96 | 24/09/2020 | Thursday  | week    | 20.67     |
> +-----+------------+-----------+---------+-----------+ | c96 | 25/09/2020 | Friday    | week    | 222.23    |
> +-----+------------+-----------+---------+-----------+ | c96 | 26/09/2020 | Saturday  | weekend | 449.84    |
> +-----+------------+-----------+---------+-----------+ | c96 | 27/09/2020 | Sunday    | weekend | 438.47    |
> +-----+------------+-----------+---------+-----------+ | c96 | 28/09/2020 | Monday    | week    | 10.44     |
> +-----+------------+-----------+---------+-----------+ | c96 | 29/09/2020 | Tuesday   | week    | 293.59    |
> +-----+------------+-----------+---------+-----------+ | c96 | 30/09/2020 | Wednesday | week    | 194.49    |
> +-----+------------+-----------+---------+-----------+

My script is as follows, but it is unfortunately too slow and not the pandas way of doing things. How I could do it more efficiently?

    from scipy.stats import ttest_ind, ttest_ind_from_stats

    p_val = []
    stat_flag = []
    all_ids = df.id.unique()
    alpha = 0.05
    print(len(all_ids))
    for id in all_ids:
        t = df[df.id ==  id]
        group1 = t[t.tow == 'week']
        group2 = t[t.tow == 'weekend']
        t, p_value_ttest = ttest_ind(group1.daily_avg, group2.daily_avg, equal_var=False)
        if p_value_ttest < alpha:
           p_val.append(p_value_ttest)
           stat_flag.append(1)
        else: 
           p_val.append(p_value_ttest)
           stat_flag.append(0)

p-val gives the p-values for every id.

Original Q&A

There are 2 best solutions below

Hugo Guillen On 12 October 2020 at 09:21

I cannot benchmark without sample data, but maybe you could try a groupby instead of your for loop:

for id,t in df.groupby('id'):    
    group1 = t[t.tow == 'week']
    group2 = t[t.tow == 'weekend']
    t, p_value_ttest = ttest_ind(group1.daily_avg, group2.daily_avg, equal_var=False)
    if p_value_ttest < alpha:
        p_val.append(p_value_ttest)
        stat_flag.append(1)
    else: 
        p_val.append(p_value_ttest)
        stat_flag.append(0)

**jlandercy** · Accepted Answer

Dataset

Based on dataset you provided:

import io
from scipy import stats
import pandas as pd

s = """id|usage_day|dow|tow|daily_avg
c96|01/09/2020|Tuesday|week|393.07
c96|02/09/2020|Wednesday|week|10.38
c96|03/09/2020|Thursday|week|429.35
c96|04/09/2020|Friday|week|156.20
c96|05/09/2020|Saturday|weekend|346.22
c96|06/09/2020|Sunday|weekend|106.53
c96|08/09/2020|Tuesday|week|194.74
c96|10/09/2020|Thursday|week|66.30
c96|17/09/2020|Thursday|week|163.84
c96|18/09/2020|Friday|week|261.81
c96|19/09/2020|Saturday|weekend|410.30
c96|20/09/2020|Sunday|weekend|266.28
c96|23/09/2020|Wednesday|week|346.18
c96|24/09/2020|Thursday|week|20.67
c96|25/09/2020|Friday|week|222.23
c96|26/09/2020|Saturday|weekend|449.84
c96|27/09/2020|Sunday|weekend|438.47
c96|28/09/2020|Monday|week|10.44
c96|29/09/2020|Tuesday|week|293.59
c96|30/09/2020|Wednesday|week|194.49"""
df = pd.read_csv(io.StringIO(s), sep='|')

I add a new id with similar data for groupby clarity sake:

df2 = df.copy()
df2['id'] = 'c97'
df = pd.concat([df, df2])

MCVE

You do not have to resort to any explicit loop, instead take advantage of the apply method which operates on frames and also works with groupby.

To do that, we define a function performing the desired test on a DataFrame (groupby will call this method for each sub dataframe corresponding to combination of grouped keys):

def ttest(x):
    g = x.groupby('tow').agg({'daily_avg': list})
    r = stats.ttest_ind(g.loc['week', 'daily_avg'], g.loc['weekend', 'daily_avg'], equal_var=False)
    s = {k: getattr(r, k) for k in r._fields}
    return pd.Series(s)

Then it suffices to chain apply after the groupby call:

T = df.groupby('id').apply(ttest)

Results is about:

     statistic    pvalue
id                      
c96  -2.128753  0.059126
c97  -2.128753  0.059126

Refactoring

Once you have understood the power of this methodology, you can refactor the above code into a reusable function such as:

def ttest(x, y):
    return stats.ttest_ind(x, y, equal_var=False)

def apply_test(x, subgroup='tow', value='daily_avg', key1='week', key2='weekend', test=ttest):
    g = x.groupby(subgroup).agg({value: list})
    r = test(g.loc[key1, value], g.loc[key2, value])
    return pd.Series({k: getattr(r, k) for k in r._fields})

T = df.groupby('id').apply(apply_test, subgroup='anotherbucket', key1='experience', key2='reference', value='threshold')

Which allow you to adapt statistical tests and DataFrame columns with respect to your needs.

T-test for groups within a Pandas dataframe for a unique id

There are 2 best solutions below

Dataset

MCVE

Refactoring

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in DATAFRAME

Related Questions in T-TEST

Trending Questions

Popular # Hahtags

Popular Questions