Pandas: Calculate total timedelta of intermediate time entries

49 Views Asked by At

I have a dataframe that looks like below

Date Name In/Out Time
2024-01-01 Homer IN 07:10
2024-01-01 Homer OUT 09:30
2024-01-01 Homer IN 10:00
2024-01-01 Homer OUT 16:00
2024-01-01 Marge In 07:15
2024-01-01 Marge Out 16:10
2024-01-01 Bart In 07:14
2024-01-01 Bart Out 10:00
2024-01-01 Bart In 10:15
2024-01-01 Bart Out 12:00
2024-01-01 Bart In 12:30
2024-01-01 Bart Out 17:00

My end goal is to calculate the total timedelta (total_seconds()) of intermediate IN/OUT times only, as highlighted. Expected output below.

Date Name TimeDelta
2024-01-01 Homer 1800
2024-01-01 Bart 2700

I haven't been to sure where to start with this one and cannot find any examples anywhere else?

example dataframe:

example_df = pd.DataFrame([
    ['2024-01-01', 'Homer', 'in', '07:30'],
    ['2024-01-01', 'Homer', 'out' ,'09:00'],
    ['2024-01-01', 'Homer', 'in' ,'09:30'],
    ['2024-01-01', 'Homer', 'out' ,'16:00'],
    ['2024-01-01', 'Marge', 'in' , '06:20'],
    ['2024-01-01', 'Marge', 'out' ,'16:00'],
    ['2024-01-01', 'Bart', 'in' ,'07:10'],
    ['2024-01-01', 'Bart', 'out' ,'08:00'],
    ['2024-01-01', 'Bart', 'in' ,'08:20'],
    ['2024-01-01', 'Bart', 'out' ,'17:00'],
    ['2024-01-01', 'Barney', 'in' ,'08:10'],
    ['2024-01-01', 'Lisa', 'in' ,'08:05'],
    ['2024-01-01', 'Lisa', 'out' ,'14:00'],
    ['2024-01-01', 'Lisa', 'in' ,'14:15'],
    ['2024-01-01', 'Lisa', 'out' ,'18:10'],
    ['2024-01-01', 'Millhouse', 'out' ,'19:10'],
    ['2024-02-01', 'Homer', 'in', '07:30'],
    ['2024-02-01', 'Homer', 'out' ,'09:00'],
    ['2024-02-01', 'Marge', 'in' , '06:30'],
    ['2024-02-01', 'Marge', 'out' ,'09:10'],
    ['2024-02-01', 'Marge', 'in' ,'10:10'],
    ['2024-02-01', 'Marge', 'out' ,'16:10'],
    ['2024-02-01', 'Bart', 'in' ,'07:10'],
    ['2024-02-01', 'Bart', 'out' ,'15:00'],
    ['2024-02-01', 'Barney', 'in' ,'08:10'],
    ['2024-02-01', 'Lisa', 'in' ,'08:05'],
    ['2024-02-01', 'Lisa', 'out' ,'16:00'],
    ['2024-02-01', 'Millhouse', 'in' ,'08:10'],
    ['2024-02-01', 'Millhouse', 'in' ,'08:10'],
    ['2024-02-01', 'Millhouse', 'in' ,'16:15']],
    columns=['Date', 'Name', 'In/Out', 'Time'])
2

There are 2 best solutions below

4
mozway On BEST ANSWER

Assuming Time is sorted within a group, that the first In/Out is always In, and that In/Out are always alternating.

You could convert the times to_datetime, then use groupby.apply to compute the diff, ignore the first/last value (with iloc) and sum the "IN" timedeltas before converting to total_seconds:

# cleanup IN/OUT format
df['In/Out'] = df['In/Out'].str.upper()

out = (df
  .assign(dt=pd.to_datetime(df['Time'], format='%H:%M'))
  .groupby(['Date', 'Name'])
  .apply(lambda g: 
         g['dt'].diff().iloc[1:-1]
         [g['In/Out'].eq('IN')]
         .sum().total_seconds())
  .reset_index(name='TimeDelta')
  .query('TimeDelta>0') # optional: remove rows with null TimeDelta
)

Output:

         Date       Name  TimeDelta
1  2024-01-01   J Bloggs     1800.0
2  2024-01-01  M Simpson     2700.0

NB. If any of the initial assumptions is incorrect, you just need to pre-process the data to sort it and remove invalid rows.

Example:

# cleanup IN/OUT format
df['In/Out'] = df['In/Out'].str.upper()

m1 = df.sort_values(by='Time').groupby(['Date', 'Name'])['In/Out'].shift(-1).ne(df['In/Out'])

out = (df[m1]
  .assign(dt=pd.to_datetime(df.loc[m1, 'Time'], format='%H:%M'))
  .groupby(['Date', 'Name'])
  .apply(lambda g: 
         g['dt'].diff().iloc[1:-1]
         [g['In/Out'].eq('IN')]
         .sum().total_seconds())
  .reset_index(name='TimeDelta')
  .query('TimeDelta>0') # optional: remove rows with null TimeDelta
)

Output:

          Date   Name  TimeDelta
1   2024-01-01   Bart     1200.0
2   2024-01-01  Homer     1800.0
3   2024-01-01   Lisa      900.0
10  2024-02-01  Marge     3600.0
0
Jamie On

I would convert your dataframe so that each record was on person on day X with an in time and an out time. Then you can just calculate the delta from those two times. Here is how I would do it.

import pandas as pd
import random as rd
names=['John','Jeff']
dates=['01/01/2024','01/02/2024']
intimes=['08:00','09:00']
outtimes=['17:00','17:30']

namelist=[]
datelist=[]
inlist=[]
outlist=[]

for i in names:
    for j in dates:
        namelist.append(i)
        datelist.append(j)
        inlist.append(rd.sample(intimes,1)[0])
        outlist.append(rd.sample(outtimes,1)[0])
        
df=pd.DataFrame()
df['Name']=namelist
df['Date']=datelist
df['In']=inlist
df['Out']=outlist

df['InDateandTime']=pd.to_datetime(df['Date']+' '+df['In'])
df['OutDateandTime']=pd.to_datetime(df['Date']+' '+df['Out'])

df['TimeDelta']=df['OutDateandTime']-df['InDateandTime']


display(df)

And here is the output:

enter image description here

You just need to make a new dataframe that converts your records to a single record per person per day and then apply this method to that dataframe. You will need to combine the dates and times as strings first and then use pd.DateTime to convert them to DateTime. I have shown how to do this in the code. You can also drop the original 'Date', 'In' and 'Out' columns but I left them to show you exactly what I did. Much of the code is just creating a dataset that I could use as an example.