An illustration of my question
Dealing with hourly varying time series data which contain datetime from 2016-01-01 00:00 to 2016-01-07 23:00 and some feature corresponding to each timeframe.
Ideally, there would be 7 x 24 rows of data covering all the time period. But sometimes, the hourly data may interrupt by some reason (e.g capturing hourly weather information, but the website were broke in specific hours.)
My dataframe now contain all identical data without missing hour. The length of my dataframe were 7 *24 - 5, which mean there were 5 missing hourly data.
The datetime are saved in the format of 201601010100
representing 2016-01-01 01:00:00
My attempt.
I have tried to generated new dataframe which is length of 7*24 containing continuous hourly datetime
data = {"datetime":[],"feature1":[],"feature2":[]}
ff = pd.DataFrame(data)
rng = pd.date_range('01/01/2016', periods=600, freq='H')
new_date = list(rng.strftime('%Y%m%d%H'))
ff['datetime'] = new_date
The original dataframe was df
which containing datetime
, feature1
,feature2
.
I tried to f3 = ff.merge(df, on ='dataframe'').
But f3 was still in the length of 7*24.
My target
Due to large temporal coverage and feature dimension in pratical, I don't want to loop the two dataframe and compare their datetime.values one by one.
I want to replace the value of feature in ff based on the identical datetime in df and keep the NaN information still.
Create an empty dataframe with rng as index:
Convert the original dates to numpy.datetime64 to make them compatible with timerange:
Perform an outer join of the frames on index and datetime_ns:
Sort the new dataframe, if necessary: