I'm designing a database table to store machine learning features (i.e. a feature store) and was thinking to use this tall/narrow schema: event_date, feature_name, feature_value, creation_date
.
This works for simple features, but seems to fall short on more complex scenarios.
Let's consider an example scenario where we want to store features related to a shop (item purchases, orders cancelled...), we might have features that look like
event_date | store_name | feature_name | feature_value | creation_date |
---|---|---|---|---|
2021-01-01 | food_corp | items_sold | 10 | 2021-05-01 |
2021-01-02 | food_corp | items_sold | 5 | 2021-05-01 |
2021-01-01 | supplements_corp | items_sold | 8 | 2021-05-01 |
2021-01-02 | supplements_corp | items_sold | 3 | 2021-05-01 |
2021-01-01 | food_corp | orders_cancelled | 2 | 2021-05-01 |
but what if i want to track more granular data, for example product names or customer types? Is the above schema flexible enough? How would i store that additional metadata?
Schema proposal 1
This incorporates metadata in the feature name, for example cheese_item_purchases_french_customer
indicates the purchase of cheese
item from a french_customer
, storing these two pieces of metadata in the feature_name
event_date | store_name | feature_name | feature_value | creation_date |
---|---|---|---|---|
2021-01-01 | food_corp | cheese_item_purchases_french_customer | 9 | 2021-05-01 |
2021-01-01 | food_corp | cheese_item_purchases_german_customer | 4 | 2021-05-01 |
2021-01-01 | supplements_corp | vitamin_d_item_purchases_french_customer | 7 | 2021-05-01 |
2021-01-01 | supplements_corp | vitamin_d_item_purchases_german_customer | 2 | 2021-05-01 |
2021-01-01 | food_corp | orders_cancelled_french_customer | 2 | 2021-05-01 |
Seems hard to query and one would need to know the exact structure of the feature_name column
Schema proposal 2
Add a metadata column
event_date | store_name | feature_name | metadata | feature_value | creation_date |
---|---|---|---|---|---|
2021-01-01 | food_corp | items_sold | {product_name: cheese, customer_class: french} | 9 | 2021-05-01 |
2021-01-01 | food_corp | items_sold | {product_name: cheese, customer_class: german} | 4 | 2021-05-01 |
2021-01-01 | supplements_corp | items_sold | {product_name: vitamin_d, customer_class: french} | 7 | 2021-05-01 |
2021-01-01 | supplements_corp | items_sold | {product_name: vitamin_d, customer_class: german} | 2 | 2021-05-01 |
2021-01-01 | food_corp | orders_cancelled | {customer_class: french} | 2 | 2021-05-01 |
Also seems hard (and inefficient?) to query
Schema proposal 3
Uses two tables to store feature values and feature metadata
feature_store_table
event_date | store_name | feature_name_hash | feature_value | creation_date |
---|---|---|---|---|
2021-01-01 | food_corp | feature_1 | 10 | 2021-05-01 |
2021-01-01 | food_corp | feature_2 | 5 | 2021-05-01 |
2021-01-01 | supplements_corp | feature_1 | 8 | 2021-05-01 |
2021-01-01 | supplements_corp | feature_2 | 3 | 2021-05-01 |
2021-01-01 | food_corp | feature_3 | 2 | 2021-05-01 |
feature_metadata_table
feature_name_hash | feature_name | metadata_name | metadata_value |
---|---|---|---|
feature_1 | items_sold | product_name | cheese |
feature_1 | items_sold | customer_class | french |
feature_2 | items_sold | product_name | cheese |
feature_2 | items_sold | customer_class | german |
feature_3 | orders_cancelled | customer_class | french |
Seems the most flexible and clean, but will make queries likely more complex. For example, how do i retrieve all entries in feature_store_table
having {'product_name': 'cheese', 'customer_class': 'french'}
?
Of course the alternative to all of this would be to use multiple short/wide tables for each feature but for my use case I'd prefer to stick to a tall/narrow format.
Do you have any recommendation on the proposed approaches or any better ones that i've missed? Regardless of the above premise, should i definitely think to move to multiple short/wide tables?
Thanks