AWS Personalize items attributes

1.1k Views Asked by At

I'm trying to implement personalization and having problems with Items schema.

Imagine I'm Amazon, I've products their brands and their categories. In what kind of Items schema should I include this information?

Should I include brand name as string as categorical field? Should I rather include brand ID as string or numeric? or should I include both?

What about categories? I've the same questions.

Metadata Fields Metadata includes string or non-string fields that aren't required or don't use a reserved keyword. Metadata schemas have the following restrictions:

Users and Items schemas require at least one metadata field,

Users and Interactions datasets can contain up to five metadata fields. An Items dataset can contain up to 50 metadata fields.

If you add your own metadata field of type string, it must include the categorical attribute. Otherwise, Amazon Personalize won't use the field when training a model.

https://docs.aws.amazon.com/personalize/latest/dg/how-it-works-dataset-schema.html

1

There are 1 best solutions below

0
On

There are simply 2 ways to include your metadata in Items/Users datasets:

  1. If it can be represented as a number value, then provide the actual value if it makes sense.
  2. If it can be represented as string, then provide the string value and make sure, that categorical is set to true.

But let's take a look into "Why does they need me, to categorize my strings metadata?". The answer is pretty simple.

Let's start with an example. If you would have Items as Amazon.com products and you would like to provide rates metadata field, then:

  1. You could take all of the rates including the full review text sent by clients and simply put it as metadata field.
  2. You can take just stars rating, calculate the average and put it as metadata field.

Probably the second one is making more sense in general. Having random, long reviews of product as metadata, pretty much changes nothing. Personalize doesn't understands if the review itself is good or bad, or if the author also recommends another product, so pretty much it doesn't really add anything to the recommendations.

However if you simply "cut" your dataset and calculate the average rating, like in the 2. point, then it makes a lot more sense. Maybe some of our customers like crappy products? Maybe they want to buy them, because they are famous YouTubers and they create videos about that? Based on their previous interactions and much more, Personalize will be able to perform just slightly better, because now it knows, that this product has rating of 5/5 or 3/5.

I wanted to show you, that for some cases, providing Items metadata as string makes no sense. That's why your string metadata must be categorical. It means, that it should be finite set of values, so it adds some knowledge for Personalize about given Item and why some of people might want to interact with it.

Going back to your question:

Should I include brand name as string as categorical field? Should I rather include brand ID as string or numeric? or should I include both?

I would simply go with brand ID as string. You could also go with brand name, but probably single brand can be renamed, when it's still the same brand, so picking up the ID would be more constant. Also two different brands could have the same names, because they are present on different markets, so picking up the ID solves that.

The "categorical": true switch in your schema just tells Personalize:

Hey, do you see that string field? It's categorised, finite set of values. If you train a model for me, please include this one during the training, it's important!

And as it's said in documentation, if you will provide string metadata field, which is not marked as categorical, then Personalize will "think" that:

Hmm.. this field is a string, it has pretty random values and it's not marked as categorical. It's probably just a leftover from Items export job. Let's ignore that.