How to Integrate TsFresh Feature Extraction Output with Original Time Series in PySpark

36 Views Asked by At

I'm working on a time series analysis project using PySpark and TsFresh for feature engineering. After extracting features from my time series data using tsfresh.convenience.bindings.spark_feature_extraction_on_chunk(), I encountered a challenge in integrating the extracted features back into my original time series data.

Here's the process I followed:

Data Preparation: I started with a PySpark DataFrame df_melted containing a time series with columns id, first_event_date, and capacity_tonnes. Feature Extraction with TsFresh:

from tsfresh.convenience.bindings import spark_feature_extraction_on_chunk
from tsfresh.feature_extraction import ComprehensiveFCParameters

df_grouped = df_melted.groupby(["id_index", "kind"])
features = spark_feature_extraction_on_chunk(df_grouped, column_id="id_index", column_kind="kind",
                                             column_sort="first_event_date", 
                                             column_value="value",
                                             default_fc_parameters=ComprehensiveFCParameters())

Current Output: The features DataFrame contains several extracted features for each id, but it's no longer in a time series format.

My Goal: I want to use these extracted features in a time series model. Specifically, I need to reintegrate these features with my original time series data in a way that preserves the temporal structure for subsequent modeling (e.g., forecasting with XGBoost).

Question: How can I effectively merge the extracted features back into my original time series data while maintaining the temporal sequence? Are there best practices or specific methods in PySpark that I should use for this kind of operation?

0

There are 0 best solutions below