I want to select columns based on their datetime data types. My DataFrame has for example columns with types np.dtype('datetime64[ns]'), np.datetime64 and 'datetime64[ns, UTC]'.
Is there a generic way to select all columns with a datetime datatype?
For instance, this works:
from sklearn.compose import make_column_selector
selector = make_column_selector(dtype_include=(np.dtype('datetime64[ns]'),np.datetime64))
selected_columns = selector(df)
But this doesn't (datatype from a pandas df with 'UTC'):
from sklearn.compose import make_column_selector
selector = make_column_selector(dtype_include=np.dtype('datetime64[ns, UTC]')
selected_columns = selector(df)
Compared to numeric data types where you can simply use np.number instead of np.int64 etc.
API-Reference to make_column_selector: LINK
As far as I know, there is not parent class to catch all the datetime columns in a generic way as you pointed out with
np.number.My guess is you want to use it in a
sklearnpipelinefor preprocessing? In that case you can use a custom selector:Test
This is a simple test of the custom selector on a dummy
DataFramewith different types of data types, and for the dates also with timezone-aware and timezone-naive dates.After preprocessing: