Using Turi to create a simple text classification

106 Views Asked by At

to get in touch with Turi I'm trying to create a model that is able to distinguish between strings consisting of chars and strings consisting of numbers. I have CSV-file with training data. Each line consists of two entries, a string and an indicator whether this string is a number or a plane string

String, isNumber
bvmuuflo , 0
71047015 , 1

My Python-Script to generate the model looks like this:

import graphlab as gl
data =  gl.SFrame('data.csv')
model = gl.classifier.create(data, target="isNumber", features=["String"])

This works fine. But I have no idea how to use the model to check for example if "qwerty" is a String or a Number. I'm trying to use the model.classify(...) API-call. But the two calls

model.classify(gl.SFrame(["qwertzui"])

and

model.classify(gl.SFrame(["98765432"])

return the same result

Columns:
    class   int
    probability float

Rows: 1

Data:
+-------+----------------+
| class |  probability   |
+-------+----------------+
|   1   | 0.509227594584 |
+-------+----------------+
[1 rows x 2 columns]

Obviously there is a mistake in my program, but I'm not able to find it. Any help is welcome!

1

There are 1 best solutions below

0
On

Since the model only has one column for training it will be able to identify strings it has already seen but unable to identify ones it has not. My guess is the .509 is the percentage of your input that is a string, so it just responds with that for anything it has not seen before.

This is obviously a toy example but if you want to get it to work I would use something like a bag of words, but for letters. Make 36 columns with the titles a,b,c...z,0,1...9 and put the count of each character per string for each row. This way the model will look at individual letters as giving a probability to the class instead of the string as a whole.