Pandas Dataframes: Forward Fill Multiple Rows

182 Views Asked by At
input = pd.DataFrame({
    'Timestamp': [
        pd.Timestamp('19/01/2022  10:00:00'),
        pd.Timestamp('19/01/2022  10:00:00'),
        pd.Timestamp('19/01/2022  15:00:00'),
        pd.Timestamp('19/01/2022  15:30:00'),
        pd.Timestamp('19/01/2022  16:00:00'),
        pd.Timestamp('19/01/2022  19:30:00'),
        pd.Timestamp('19/01/2022  20:00:00'),
        pd.Timestamp('19/01/2022  20:30:00'),
        pd.Timestamp('20/01/2022  13:00:00'),
        pd.Timestamp('20/01/2022  13:30:00'),
        pd.Timestamp('20/01/2022  14:00:00'),
        pd.Timestamp('20/01/2022  14:50:00'),
        pd.Timestamp('20/01/2022  15:00:00')],
    'Name': [
        'A', 'B', np.NaN, np.NaN, np.NaN,
        'C', np.NaN, np.NaN, np.NaN, np.NaN, np.NaN,
        'D', np.NaN]})

I am trying to forward fill multiple rows at once between timestamps but I have not found a quick way to do it. Could you please share your solution?

Each row corresponds to a name entry at a given timestamp. There can be multiple names for the same timestamp. I want to propagate this group of names until the next non-nan value.

I have tried a simple for loop but this is relatively slow (the array will have about 100,000 rows).

Desired output is:

desired_output = pd.DataFrame({
    'Timestamp': [
        pd.Timestamp('19/01/2022  10:00:00'),
        pd.Timestamp('19/01/2022  10:00:00'),
        pd.Timestamp('19/01/2022  15:00:00'),
        pd.Timestamp('19/01/2022  15:00:00'),
        pd.Timestamp('19/01/2022  15:30:00'),
        pd.Timestamp('19/01/2022  15:30:00'),
        pd.Timestamp('19/01/2022  16:00:00'),
        pd.Timestamp('19/01/2022  16:00:00'),
        pd.Timestamp('19/01/2022  19:30:00'),
        pd.Timestamp('19/01/2022  20:00:00'),
        pd.Timestamp('19/01/2022  20:30:00'),
        pd.Timestamp('20/01/2022  13:00:00'),
        pd.Timestamp('20/01/2022  13:30:00'),
        pd.Timestamp('20/01/2022  14:00:00'),
        pd.Timestamp('20/01/2022  14:50:00'),
        pd.Timestamp('20/01/2022  15:00:00')],
    'Name': [
        'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B',
        'C', 'C', 'C', 'C', 'C', 'C',
        'D', 'D']})

Please find my attempt below:

import time

t0 = time.time()

unique_timestamps = input.Timestamp.unique()
new_entries = []
last_valid = None
for ut in unique_timestamps:
    val = input[input.Timestamp == ut]['Name'].values
    if type(val[0])==float and np.isnan(val[0]) and last_valid is not None:
        new_entries.append(pd.DataFrame({'Timestamp': ut, 
                                         'Name': last_valid}))
    else:
        last_valid = input[input.Timestamp == ut]['Name']
output = pd.concat([input, pd.concat(new_entries)]).dropna().sort_values('Timestamp')

t1 = time.time()
print(str(t1-t0) + 's')
2

There are 2 best solutions below

2
On

You can groupby Timestamp and agg Name into a collection, then ffill and explode.

I'm not sure how to do this most efficiently, but the simplest way I've found to write it is like this:

(
    df
    .groupby('Timestamp')
    .agg({'Name': lambda s: s if pd.notna(s.iat[0]) else None})
    .ffill()
    .explode('Name')
    )

This follows your logic on how to detect NaNs, i.e. pd.notna(s.iat[0]) takes the place of np.isnan(val[0]). There are possibly simpler ways to do it, but I'm not sure if they would work, e.g. even lambda s: s, which I'm surprised works the way it does (single elements become scalars while multiple elements become arrays).

Result:

                    Name
Timestamp               
2022-01-19 10:00:00    A
2022-01-19 10:00:00    B
2022-01-19 15:00:00    A
2022-01-19 15:00:00    B
2022-01-19 15:30:00    A
2022-01-19 15:30:00    B
2022-01-19 16:00:00    A
2022-01-19 16:00:00    B
2022-01-19 19:30:00    C
2022-01-19 20:00:00    C
2022-01-19 20:30:00    C
2022-01-20 13:00:00    C
2022-01-20 13:30:00    C
2022-01-20 14:00:00    C
2022-01-20 14:50:00    D
2022-01-20 15:00:00    D
0
On

Merge

You could de-duplicate the key column (Timestamp), then ffill it according to the values (Name), then merge with the original, which takes care of the multiple values. Here I'm assigning a separate column with the ffill, to only be used as a merge key.

(
    df[['Timestamp']].drop_duplicates()
    .assign(_ts_merge=lambda d: d.where(df['Name'].notna()).ffill())
    .merge(df.set_index('Timestamp'), left_on='_ts_merge', right_index=True)
    .drop(columns='_ts_merge')
)
             Timestamp Name
0  2022-01-19 10:00:00    A
0  2022-01-19 10:00:00    B
2  2022-01-19 15:00:00    A
2  2022-01-19 15:00:00    B
3  2022-01-19 15:30:00    A
3  2022-01-19 15:30:00    B
4  2022-01-19 16:00:00    A
4  2022-01-19 16:00:00    B
5  2022-01-19 19:30:00    C
6  2022-01-19 20:00:00    C
7  2022-01-19 20:30:00    C
8  2022-01-20 13:00:00    C
9  2022-01-20 13:30:00    C
10 2022-01-20 14:00:00    C
11 2022-01-20 14:50:00    D
12 2022-01-20 15:00:00    D

This works a bit differently from your code with how it handles NaNs, but I'm not sure if there's any practical difference.