How to impute based on previous known departure and next known arrival?

56 Views Asked by At

Example

Basically, I have a list of arrival and departure timings. In the picture below, i = 3 and i = 4 has departure 0. What I am looking to achieve is to take i=2's departure (as my last known), and i=5's departure (as my next known), take the difference (14:44:39 - 14:41:03) and divide it equally across the missing values I have, which is i=3 and i=4. This is not a fixed pattern and it should be dynamic (i.e, sometimes I can have 2, 3 or n number of missing values)

So my question is, how can I dynamically (or recursively) check what is my next known number and divide the differences equally among the missing values that I have?

In this example, the difference between my last known and next known is 216 seconds. Since there are 3 intervals between i=2 to i=3, i=3 to i=4 and i=4 to i=5 my timestamp for my i=3 is 14:42:15 and i=4 is 14:43:27 (72s).

1

There are 1 best solutions below

0
yht On

To solve this problem, I converted each Departure data item into a timestamps_in_seconds in order to calculate the linear interpolation between them timestamps_interpolated before converting the calculated data into the desired format departure_interpolated.

# Imports
import pandas as pd
from datetime import datetime

# Convert a timestamp to a formatted date "%Y-%m-%d %H:%M:%S"
def convert_to_datetime(timestamp):
    date_time = datetime.utcfromtimestamp(timestamp)
    formatted_date = date_time.strftime("%Y-%m-%d %H:%M:%S")
    return formatted_date

# Create a pandas Series with a mix of timestamps and zeros (inspired from the example provided)
departure_data = pd.Series([
    "2023-10-09 14:36:26",
    "2023-10-09 14:38:03",
    "2023-10-09 14:41:03",
    0,
    0,
    "2023-10-09 14:44:39",
    "2023-10-09 14:45:39",
    "2023-10-09 14:48:39",
    "2023-10-09 14:51:10",
    "2023-10-09 14:54:14",
    "2023-10-09 14:55:28",
    "2023-10-09 14:56:40",
    "2023-10-09 14:57:40",
    "2023-10-09 14:58:31",
    "2023-10-09 15:01:11",
    "2023-10-09 15:02:11",
    "2023-10-09 15:03:11",
    "2023-10-09 15:04:11",
    "2023-10-09 15:05:41",
    "2023-10-09 15:06:59",
    0,
    0,
    0,
    0,
    "2023-10-09 15:08:41"
])

# Replace zeros with None in the data Series
departure_data = departure_data.replace(0, None)

# Convert the data to DateTime objects with the specified format
departure_data = pd.to_datetime(departure_data, format="%Y-%m-%d %H:%M:%S")

# Convert DateTime objects to seconds since epoch, with None for NaT values
timestamps_in_seconds = departure_data.apply(lambda x: int(x.timestamp()) if x is not pd.NaT else None)

# Create a DataFrame with two columns: 'data' and 'timestamps_in_seconds'
df = pd.DataFrame({
    'departure': departure_data,
    'timestamps_in_seconds': timestamps_in_seconds
})

# Interpolate the 'timestamps_in_seconds' column using linear interpolation
df['timestamps_interpolated'] = df['timestamps_in_seconds'].interpolate(method='linear')

# Calculate the difference in seconds between consecutive timestamps (to a better understanding) -- not necessary
df['diff_timestamps_in_seconds'] = df['timestamps_interpolated'].diff()

# Convert the 'timestamps_interpolated' back to formatted dates
df['departure_interpolated'] = df['timestamps_interpolated'].apply(convert_to_datetime)

Here's the output (consider the departure_interpolated column):

departure timestamps_in_seconds timestamps_interpolated diff_timestamps_in_seconds departure_interpolated
0 2023-10-09 14:36:26 1.69686e+09 1.69686e+09 nan 2023-10-09 14:36:26
1 2023-10-09 14:38:03 1.69686e+09 1.69686e+09 97 2023-10-09 14:38:03
2 2023-10-09 14:41:03 1.69686e+09 1.69686e+09 180 2023-10-09 14:41:03
3 NaT nan 1.69686e+09 72 2023-10-09 14:42:15
4 NaT nan 1.69686e+09 72 2023-10-09 14:43:27
5 2023-10-09 14:44:39 1.69686e+09 1.69686e+09 72 2023-10-09 14:44:39
6 2023-10-09 14:45:39 1.69686e+09 1.69686e+09 60 2023-10-09 14:45:39
7 2023-10-09 14:48:39 1.69686e+09 1.69686e+09 180 2023-10-09 14:48:39
8 2023-10-09 14:51:10 1.69686e+09 1.69686e+09 151 2023-10-09 14:51:10
9 2023-10-09 14:54:14 1.69686e+09 1.69686e+09 184 2023-10-09 14:54:14
10 2023-10-09 14:55:28 1.69686e+09 1.69686e+09 74 2023-10-09 14:55:28
11 2023-10-09 14:56:40 1.69686e+09 1.69686e+09 72 2023-10-09 14:56:40
12 2023-10-09 14:57:40 1.69686e+09 1.69686e+09 60 2023-10-09 14:57:40
13 2023-10-09 14:58:31 1.69686e+09 1.69686e+09 51 2023-10-09 14:58:31
14 2023-10-09 15:01:11 1.69686e+09 1.69686e+09 160 2023-10-09 15:01:11
15 2023-10-09 15:02:11 1.69686e+09 1.69686e+09 60 2023-10-09 15:02:11
16 2023-10-09 15:03:11 1.69686e+09 1.69686e+09 60 2023-10-09 15:03:11
17 2023-10-09 15:04:11 1.69686e+09 1.69686e+09 60 2023-10-09 15:04:11
18 2023-10-09 15:05:41 1.69686e+09 1.69686e+09 90 2023-10-09 15:05:41
19 2023-10-09 15:06:59 1.69686e+09 1.69686e+09 78 2023-10-09 15:06:59
20 NaT nan 1.69686e+09 20.4 2023-10-09 15:07:19
21 NaT nan 1.69686e+09 20.4 2023-10-09 15:07:39
22 NaT nan 1.69686e+09 20.4 2023-10-09 15:08:00
23 NaT nan 1.69686e+09 20.4 2023-10-09 15:08:20
24 2023-10-09 15:08:41 1.69686e+09 1.69686e+09 20.4 2023-10-09 15:08:41