I'm building a recommender system. I have the following table that stores the features of each item:
class Item_feature(db.Model):
__tablename__ = "Item_feature"
item_id = db.Column(db.Integer, primary_key=True)
feature_id = db.Column(db.Integer, primary_key=True)
weight = db.Column(db.Integer)
I didn't need to create an Item table, since the only important information is their ID (besides the features). That explains why the item_id key is not a foreign one.
Each item could have, let's say 40 features. So, if the recommender had 200k items, there would be millions of rows. I need to keep items in memory to avoid fetching them when a suggestion has to be made. I didn't find a better representation to store these features, so ideas on that regard would be very useful as well.
When initializing the system I need to get all features of all items. If items were not sorted by item_id (or at least grouped by item_id) I would have to look for the item with the same item_id on each loop over the list of features.
I'm using this query with SQLAlchemy to get the features:
features = model.Item_feature.query.all().order_by(model.Item_feature.item_id)
But this might be really slow. Using the following:
features = model.Item_feature.query.all()
doesn't guarantee that results will be sorted. It seems to depend on the order things are added to the DB.
If features were grouped or sorted I could do something more efficient, using a single loop, like:
item = None
item_id = None
for f in features:
# Adds features to item until item_id changes
if item_id != f.item_id:
item_id = f.item_id
item = Item(item_id)
self.items[item_id] = item # Adding new item to the dictionary of items
item.new_feature()
As showed in the code, I'm using a dictionary to store the items. I'm not really sure if this is a good choice.
So, how could I get the rows grouped (or sorted) by the first primary key user_id in an efficient way considering the possible huge amount of rows?
To store machine learning weights, I'd try a "NoSQL solution" (on PostgreSQL or a file even), because it's just impractical to deal with millions of weights on a row basis.
The idea is, when the ML is done training/retraining, store your weights in the same exact format (structure, ordering, grouping, etc) that you'll need to read them in to be used in the app to make predictions/recommendations.
This amortizes the cost of restructuring the data into the training phase to be written one-time, and avoid restructuring the data when the "system initializes" as you put it.
PostgreSQL's
JSONB
field can hold 1GB of data. You can hold dictionaries and lists, so you can preserve the structure you're computing with your Python snippet.Or if you also want to avoid deserializing the weights into the format used by your model, you may store it as a binary blob.
Otherwise storing the weights in the same format on a file works too.