Python based multi-label Classification

314 Views Asked by At

I have a data set something like shown below which in real scenario wil have row count something between 10000 to 1000000. There would be more columns but the core problem revolves round these two fields.

Known Labels

I have known categories -'Apple', 'Blueberry','Orange','Lettuce'

Dataset

 DataFrame
({'ROWID':1,2,3,4,5,6,7,8,9,10],
'Category':'Apple','Blueberry'.'Orange','Lettuce','Fruit','Salad','xyz','Fruit' 
,'Leaf','Avocado'],
'Details':['Eat one a day ,doctors keep away','Like it in a  muffin',
'Tastes yummy','Like it with 
salmon','Glass of a juice','Ceser dressing  on  lettuce','Nothing in my 
basket','Like it in a muffin','I like it  it with  salami','Comes from 
Mexico']}) 

Problem:

I have to create one or many metrics using groupby on category

When the category column has unknown cell value I need to read the text from the 'Details' and predict the best suited label for category. For example

  • Salad ->Lettuce, Fruit(Row#5)-> Orange Fruit(Row#8)-> Blueberry Leaf(Row#9)-> 'Lettuce' It is understood that some of the rows can not be categorized.

Help Needed:

I am a newbie in data science algorithm, looking for some guidance to identify the right model to solve the problem.

1

There are 1 best solutions below

1
On BEST ANSWER

Use Naive Bayes for the Details column, before that do a simple filtering on the Category column and remove rows having known category values.