I have the following dataframe and I am performing a t-test between all days of the weekday and all days of a weekend in a month for every ID.
> +-----+------------+-----------+---------+-----------+ | id | usage_day | dow | tow | daily_avg |
> +-----+------------+-----------+---------+-----------+ | c96 | 01/09/2020 | Tuesday | week | 393.07 |
> +-----+------------+-----------+---------+-----------+ | c96 | 02/09/2020 | Wednesday | week | 10.38 |
> +-----+------------+-----------+---------+-----------+ | c96 | 03/09/2020 | Thursday | week | 429.35 |
> +-----+------------+-----------+---------+-----------+ | c96 | 04/09/2020 | Friday | week | 156.20 |
> +-----+------------+-----------+---------+-----------+ | c96 | 05/09/2020 | Saturday | weekend | 346.22 |
> +-----+------------+-----------+---------+-----------+ | c96 | 06/09/2020 | Sunday | weekend | 106.53 |
> +-----+------------+-----------+---------+-----------+ | c96 | 08/09/2020 | Tuesday | week | 194.74 |
> +-----+------------+-----------+---------+-----------+ | c96 | 10/09/2020 | Thursday | week | 66.30 |
> +-----+------------+-----------+---------+-----------+ | c96 | 17/09/2020 | Thursday | week | 163.84 |
> +-----+------------+-----------+---------+-----------+ | c96 | 18/09/2020 | Friday | week | 261.81 |
> +-----+------------+-----------+---------+-----------+ | c96 | 19/09/2020 | Saturday | weekend | 410.30 |
> +-----+------------+-----------+---------+-----------+ | c96 | 20/09/2020 | Sunday | weekend | 266.28 |
> +-----+------------+-----------+---------+-----------+ | c96 | 23/09/2020 | Wednesday | week | 346.18 |
> +-----+------------+-----------+---------+-----------+ | c96 | 24/09/2020 | Thursday | week | 20.67 |
> +-----+------------+-----------+---------+-----------+ | c96 | 25/09/2020 | Friday | week | 222.23 |
> +-----+------------+-----------+---------+-----------+ | c96 | 26/09/2020 | Saturday | weekend | 449.84 |
> +-----+------------+-----------+---------+-----------+ | c96 | 27/09/2020 | Sunday | weekend | 438.47 |
> +-----+------------+-----------+---------+-----------+ | c96 | 28/09/2020 | Monday | week | 10.44 |
> +-----+------------+-----------+---------+-----------+ | c96 | 29/09/2020 | Tuesday | week | 293.59 |
> +-----+------------+-----------+---------+-----------+ | c96 | 30/09/2020 | Wednesday | week | 194.49 |
> +-----+------------+-----------+---------+-----------+
My script is as follows, but it is unfortunately too slow and not the pandas way of doing things. How I could do it more efficiently?
from scipy.stats import ttest_ind, ttest_ind_from_stats
p_val = []
stat_flag = []
all_ids = df.id.unique()
alpha = 0.05
print(len(all_ids))
for id in all_ids:
t = df[df.id == id]
group1 = t[t.tow == 'week']
group2 = t[t.tow == 'weekend']
t, p_value_ttest = ttest_ind(group1.daily_avg, group2.daily_avg, equal_var=False)
if p_value_ttest < alpha:
p_val.append(p_value_ttest)
stat_flag.append(1)
else:
p_val.append(p_value_ttest)
stat_flag.append(0)
p-val gives the p-values for every id.
Dataset
Based on dataset you provided:
I add a new
id
with similar data forgroupby
clarity sake:MCVE
You do not have to resort to any explicit loop, instead take advantage of the
apply
method which operates on frames and also works withgroupby
.To do that, we define a function performing the desired test on a DataFrame (
groupby
will call this method for each sub dataframe corresponding to combination of grouped keys):Then it suffices to chain
apply
after thegroupby
call:Results is about:
Refactoring
Once you have understood the power of this methodology, you can refactor the above code into a reusable function such as:
Which allow you to adapt statistical tests and DataFrame columns with respect to your needs.