I'm working on a time series analysis project using PySpark and TsFresh for feature engineering. After extracting features from my time series data using tsfresh.convenience.bindings.spark_feature_extraction_on_chunk(), I encountered a challenge in integrating the extracted features back into my original time series data.
Here's the process I followed:
Data Preparation: I started with a PySpark DataFrame df_melted containing a time series with columns id, first_event_date, and capacity_tonnes. Feature Extraction with TsFresh:
from tsfresh.convenience.bindings import spark_feature_extraction_on_chunk
from tsfresh.feature_extraction import ComprehensiveFCParameters
df_grouped = df_melted.groupby(["id_index", "kind"])
features = spark_feature_extraction_on_chunk(df_grouped, column_id="id_index", column_kind="kind",
column_sort="first_event_date",
column_value="value",
default_fc_parameters=ComprehensiveFCParameters())
Current Output: The features DataFrame contains several extracted features for each id, but it's no longer in a time series format.
My Goal: I want to use these extracted features in a time series model. Specifically, I need to reintegrate these features with my original time series data in a way that preserves the temporal structure for subsequent modeling (e.g., forecasting with XGBoost).
Question: How can I effectively merge the extracted features back into my original time series data while maintaining the temporal sequence? Are there best practices or specific methods in PySpark that I should use for this kind of operation?