How to design a tall and narrow database table that can deal with unstructured/diverse metadata?

287 Views Asked by At

I'm designing a database table to store machine learning features (i.e. a feature store) and was thinking to use this tall/narrow schema: event_date, feature_name, feature_value, creation_date.

This works for simple features, but seems to fall short on more complex scenarios.

Let's consider an example scenario where we want to store features related to a shop (item purchases, orders cancelled...), we might have features that look like

event_date store_name feature_name feature_value creation_date
2021-01-01 food_corp items_sold 10 2021-05-01
2021-01-02 food_corp items_sold 5 2021-05-01
2021-01-01 supplements_corp items_sold 8 2021-05-01
2021-01-02 supplements_corp items_sold 3 2021-05-01
2021-01-01 food_corp orders_cancelled 2 2021-05-01

but what if i want to track more granular data, for example product names or customer types? Is the above schema flexible enough? How would i store that additional metadata?

Schema proposal 1

This incorporates metadata in the feature name, for example cheese_item_purchases_french_customer indicates the purchase of cheese item from a french_customer, storing these two pieces of metadata in the feature_name

event_date store_name feature_name feature_value creation_date
2021-01-01 food_corp cheese_item_purchases_french_customer 9 2021-05-01
2021-01-01 food_corp cheese_item_purchases_german_customer 4 2021-05-01
2021-01-01 supplements_corp vitamin_d_item_purchases_french_customer 7 2021-05-01
2021-01-01 supplements_corp vitamin_d_item_purchases_german_customer 2 2021-05-01
2021-01-01 food_corp orders_cancelled_french_customer 2 2021-05-01

Seems hard to query and one would need to know the exact structure of the feature_name column

Schema proposal 2

Add a metadata column

event_date store_name feature_name metadata feature_value creation_date
2021-01-01 food_corp items_sold {product_name: cheese, customer_class: french} 9 2021-05-01
2021-01-01 food_corp items_sold {product_name: cheese, customer_class: german} 4 2021-05-01
2021-01-01 supplements_corp items_sold {product_name: vitamin_d, customer_class: french} 7 2021-05-01
2021-01-01 supplements_corp items_sold {product_name: vitamin_d, customer_class: german} 2 2021-05-01
2021-01-01 food_corp orders_cancelled {customer_class: french} 2 2021-05-01

Also seems hard (and inefficient?) to query

Schema proposal 3

Uses two tables to store feature values and feature metadata

feature_store_table

event_date store_name feature_name_hash feature_value creation_date
2021-01-01 food_corp feature_1 10 2021-05-01
2021-01-01 food_corp feature_2 5 2021-05-01
2021-01-01 supplements_corp feature_1 8 2021-05-01
2021-01-01 supplements_corp feature_2 3 2021-05-01
2021-01-01 food_corp feature_3 2 2021-05-01

feature_metadata_table

feature_name_hash feature_name metadata_name metadata_value
feature_1 items_sold product_name cheese
feature_1 items_sold customer_class french
feature_2 items_sold product_name cheese
feature_2 items_sold customer_class german
feature_3 orders_cancelled customer_class french

Seems the most flexible and clean, but will make queries likely more complex. For example, how do i retrieve all entries in feature_store_table having {'product_name': 'cheese', 'customer_class': 'french'} ?


Of course the alternative to all of this would be to use multiple short/wide tables for each feature but for my use case I'd prefer to stick to a tall/narrow format.

Do you have any recommendation on the proposed approaches or any better ones that i've missed? Regardless of the above premise, should i definitely think to move to multiple short/wide tables?

Thanks

0

There are 0 best solutions below