IBM Watson NLC - Training with more than 20,000 text examples?

88 Views Asked by At

We're currently developing a system that would return an ICD10-CM code (A medical/diagnosis coding system) from a text input. Example

  • input 'Black Eye'
  • return 'H44 - Disorders of the globe'

Problem is, ICD10-CM has 70,000 to 100,000 codes, so it won't let me train the model after I uploaded all those text examples from .csv files.

Is using multiple models a solution or should I switch to Google's AutoML?

1

There are 1 best solutions below

0
On BEST ANSWER

If you have 70-100k codes or classes, you will not be able to train a useful model with only 20k examples. For comparison, the ImageNet dataset has 20k categories but also 14 million examples.

I do not know if ICD10-CM has broader categories, but if it does you could train a model to predict those.

Another option is to limit yourself to codes that occur at least 100 times in your examples and put all others in one class. This means you will have a lot of input for which you will not be able to return a code.

In any case I think using your model with only 20k examples for actual medical purposes would be dangerous.