Python - Push forward weekend values to Monday

522 Views Asked by At

I have a dataframe (called df) that looks like this:

df

I'm trying to take all weekend 'Volume' values (the ones where column 'WEEKDAY'=5 (saturday) or 6(sunday)) and sum them to the subsequent monday(WEEKDAY=0).

I tried a few things but nothing really worked, taking an example from the last three rows:

What I have

What I'm expecting is this:

What I expect

To reproduce the problem:

!wget https://raw.githubusercontent.com/brunodifranco/TCC/main/volume_por_dia.csv

df = pd.read_csv('volume_por_dia.csv').sort_values('Datas',ascending=True)
df['Datas'] = pd.to_datetime(df['Datas'])
df = df_volume_noticias.set_index('Datas')
df['WEEKDAY'] = df.index.dayofweek
df
7

There are 7 best solutions below

0
On

I used .groupby to solve the problem.

import pandas as pd

df = pd.read_csv('volume_por_dia.csv')
df['Datas'] = pd.to_datetime(df['Datas'])
df['WEEKDAY'] = df['Datas'].dt.dayofweek
df['index'] = df['Datas']

# Group df by date, setting frequency as week 
#(beginning Tue - so that Sat and Sun will be added to the next Mon)
df_group = df.groupby([pd.Grouper(key = 'Datas', freq='W-MON'), \
 'WEEKDAY', 'index']).agg({'Volume': 'sum'})

# In each group, add days 5, 6 (Sat and Sun) to day 0 (Mon)
df_group.loc[(slice(None), 0), 'Volume'] += \
df_group.loc[(slice(None), [5, 6]), 'Volume'].groupby(level=0).sum()

# In the grouped data, remove Sat and Sun
df_group = df_group.reset_index()
df_group = df_group[df_group['WEEKDAY'] != 5]
df_group = df_group[df_group['WEEKDAY'] != 6]

# Remove volume data from original df, and merge with volume from df_group 
df = df.drop(['Volume'], axis=1)
df = pd.merge(df,df_group[['index','Volume']],on='index', how='left')
df = df.dropna(subset=['Volume'])
df = df.drop(['index'], axis=1)

# Optional: sort dates in ascending order
df = df.sort_values(by=['Datas'])

print (df)
0
On

This solves your question using pd.shift.

import pandas as pd
df['prior_volume'] = df.Volume.shift(1)
df['prior_volume2'] = df.Volume.shift(2)
df.loc[df['WEEKDAY'] == 0, 'Volume'] = df.loc[df['WEEKDAY'] == 0, 'prior_volume'] + \
    df.loc[df['WEEKDAY'] == 0, 'prior_volume2'] + \
    df.loc[df['WEEKDAY'] == 0, 'Volume']
df = df[df['WEEKDAY'].isin(range(5))]
df = df[['Volume', 'WEEKDAY']]
df.head(10)

which yields: enter image description here

0
On

If you consider that weeks start from, for example, Tuesday, the problem becomes simpler. You just need to get the values of the weekend and sum it to the Monday of that week (which will be the Monday after the weekend). This will automatically handle cases in which you data might start/end on a weekend or not.

import numpy as np
import pandas as pd
np.random.seed(1)

# Sample data
dates = pd.date_range('2018-02-05', '2018-07-22', freq='D')
volume = np.random.randint(1, 50, len(dates))
df = pd.DataFrame(dict(Datas=dates, Volume=volume))
df = df.set_index('Datas')

# Week starting from Tuesday
week = ((df.index - pd.DateOffset(days=1)).isocalendar().week).values

def add_weekend_to_monday(week): 
    monday = week.index.weekday == 0
    weekend = week.index.weekday >= 5
    week[monday] += week[weekend].sum() 
    return week

df['Volume'] = df.groupby(week)['Volume'].apply(add_weekend_to_monday)
0
On

Input:

!wget https://raw.githubusercontent.com/brunodifranco/TCC/main/volume_por_dia.csv

import pandas as pd
import numpy as np

df = pd.read_csv('volume_por_dia.csv').sort_values('Datas',ascending=True)
df['Datas'] = pd.to_datetime(df['Datas'])
df.set_index('Datas', inplace=True)
df['WEEKDAY'] = df.index.dayofweek

I assume that the index dates are sorted, the Datas index is unique and that there are no missing days. Some assumptions I can't make are instead:

  • for each monday I have volumes from the complete previous weekend, that could be wrong because the dataframe may start on sunday and I would have an incomplete weekend volume;
  • for each weekend, I would have a following monday, that could be wrong because the dataframe may finish on saturday or sunday.

For these reasons, before computing weekend volumes, I first extract the dates of the first Saturday and last Monday:

first_saturday = df.index[df.WEEKDAY==5][0]
last_monday = df.index[df.WEEKDAY==0][-1]

Now I can extract weekend volumes being sure that I always have the saturday-sunday couple and that for each of these couple, a following monday exists in the dataframe:

df_weekend = df.loc[
    (df.WEEKDAY.isin([5,6]))&
    (df.index<=last_monday)&
    (df.index>=first_saturday)
]
df_weekend

Now, since I have couples of saturday-sunday volumes, I can compute the sums in the following way:

weekend_volumes = pd.Series(
    df_weekend.Volume.values.reshape(-1,2).sum(axis=1), #sum of volume couples
    index = df_weekend.index[1::2]+pd.Timedelta("1d"), #date of the following monday
    name="weekend_volume"
).reindex(df.index).fillna(0) #zero weekend-volume for days that are not mondays 
weekend_volumes

Finally add the weekend-volumes to the starting volumes:

df["Volume"] = df.Volume+weekend_volumes

I am attaching the last 25 lines of the df below:

# 2022-02-18    16.0    4
# 2022-02-19    2.0     5
# 2022-02-20    1.0     6
# 2022-02-21    10.0    0
# 2022-02-22    43.0    1
# 2022-02-23    36.0    2
# 2022-02-24    38.0    3
# 2022-02-25    28.0    4
# 2022-02-26    5.0     5
# 2022-02-27    3.0     6
# 2022-02-28    14.0    0
# 2022-03-01    10.0    1
# 2022-03-02    16.0    2
# 2022-03-03    18.0    3
# 2022-03-04    11.0    4
# 2022-03-05    8.0     5
# 2022-03-06    2.0     6
# 2022-03-07    32.0    0
# 2022-03-08    18.0    1
# 2022-03-09    32.0    2
# 2022-03-10    24.0    3
# 2022-03-11    18.0    4
# 2022-03-12    4.0     5
# 2022-03-13    1.0     6
# 2022-03-14    10.0    0
0
On

Based on the code you provided, you have loaded a CSV file into a DataFrame df, converted the 'Datas' column to a datetime, sorted the DataFrame by date in ascending order, and set the 'Datas' column as the index.

You have also created a new column 'WEEKDAY' that contains the day of the week (0-6, where 0 is Monday and 6 is Sunday) for each date in the index.

import pandas as pd

# Load the data and convert the 'Datas' column to a datetime
df = pd.read_csv('https://raw.githubusercontent.com/brunodifranco/TCC/main/volume_por_dia.csv')
df['Datas'] = pd.to_datetime(df['Datas'])

# Set the 'Datas' column as the index
df.set_index('Datas', inplace=True)

# Compute the sum of weekend days (Saturday and Sunday) and assign it to the next following Monday
weekend_sum = df.loc[df.index.weekday.isin([5,6])]['Volume'].resample('W-MON').sum()
df.loc[weekend_sum.index, 'Volume'] += weekend_sum.values

Explanation:

  1. df.loc[df.index.weekday.isin([5,6])] selects rows where the index (i.e., the dates) falls on a Saturday or Sunday (weekday 5 or 6, respectively).

  2. ['Volume'].resample('W-MON').sum() computes the sum of the 'Volume' column for each week starting on Monday that contains at least one weekend day. The result is a Series where the index contains the start date of each week and the values contain the corresponding sums.

  3. df.loc[weekend_sum.index, 'Volume'] += weekend_sum.values assigns the computed sums to the next following Monday. It selects the rows corresponding to the start dates of the weeks with weekend sums (weekend_sum.index) and adds the corresponding sums (weekend_sum.values) to the 'Volume' column. Note that the += operator is used to modify the original DataFrame df.

0
On

You can simply loop over the rows and keep accumulating volumes from Friday, and update the value in the volume of Sunday. Then, just drop the rows of Friday and Saturday.

values = df.values

volume_accumulated = 0
for idx, row in enumerate(values):
  if row[1] in (5, 6):
    volume_accumulated += row[0]
  elif row[1] == 0:
    volume_accumulated += row[0]
    df["Volume"][idx] = volume_accumulated
  else:
    volume_accumulated = 0

df = df[~df["WEEKDAY"].isin([5, 6])]
0
On

Adding 2 solutions here:

  1. Using pd.shift (pointed out earlier by Lukas Hestermeyer; I've added a simplified version)

  2. Using rolling window (this is literally a one-liner)

Both solutions assume that;

  1. Dates are sorted in ascending order (if not, sorting should be done before proceeding)
  2. Every weekend(saturday and sunday) record is succeeded by a monday record. Additional checks would need to be added in case of mising data

Part 1 | Data Prep:

import pandas as pd
import numpy as np

# STEP 1: Create DF
Datas = [
    '2019-07-02',
    '2019-07-03',
    '2019-07-04',
    '2019-07-05',
    '2019-07-06',
    '2019-07-07',
    '2019-07-08',
    '2022-03-10',
    '2022-03-11',
    '2022-03-12',
    '2022-03-13',
    '2022-03-14'
]

Volume = [17, 30, 20, 21, 5, 10, 12, 24, 18, 4, 1, 5]
WEEKDAY = [1, 2, 3, 4, 5, 6, 0, 3, 4, 5, 6, 0]

dic = {'Datas': Datas, 'Volume': Volume, 'WEEKDAY': WEEKDAY}

df['Datas'] = pd.to_datetime(df['Datas'])
df = df.set_index('Datas')

df = pd.DataFrame(dic)

Part 2 | Solutions:

Solution 1 [pd.shift] :

# STEP 1: add shifts
df['shift_1'] = df['Volume'].shift(1)
df['shift_2'] = df['shift_1'].shift(1)

# STEP 2: sum Volume with shifts where weekday==0
cols_to_sum = ['Volume', 'shift_1', 'shift_2']
df['Volume'] = df[['WEEKDAY'] + cols_to_sum].apply(lambda x: int(x[1]) if x[0] else int(x[1] + x[2] + x[3]), axis=1)
df = df.drop(['shift_1', 'shift_2'], axis=1)
df

Solution 2 [rolling window] :

# use rolling window of size 3 to sum where weekday == 0
df['Volume'] = np.where(
    df['WEEKDAY'] == 0,  
    df['Volume'].rolling(window=3, center=False).sum(), 
    df['Volume']
)
df

Part 3 | Removing weekend records :

df = df.loc[~df['WEEKDAY'].isin([5, 6])]
df