how can i create my own model in Stanford Pos tagger?

2.5k Views Asked by At

I want to add new tagged words( local words that is used in our region ) and create a new model. I created a .prop file from command line but how can i create a .tagger file?

When i tried to create such file as mentioned on Stanford website it shows an error like

"No model specified"

what is the -model argument, is it the corpus? how can i add my new tagged words into that?

How do I train a tagger, then?

The Stanford site says that:

You need to start with a .props file which contains options for the tagger to use. The .props files we used to create the sample taggers are included in the models directory; you can start from whichever one seems closest to the language you want to tag.

For example, to train a new English tagger, start with the left3words tagger props file. To train a tagger for a western language other than English, you can consider the props files for the German or the French taggers, which are included in the full distribution. For languages using a different character set, you can start from the Chinese or Arabic props files. Or you can use the -genprops option to MaxentTagger, and it will write a sample properties file, with documentation, for you to modify. It writes it to stdout, so you'll want to save it to some file by redirecting output (usually with >). The # at the start of the line makes things a comment, so you'll want to delete the # before properties you wish to specify.

2

There are 2 best solutions below

0
On

The model property specifies the file to which the built model will be saved. You can provide any valid path, e.g. mymodel.tagger.

You can use this same properties file at test time, and MaxentTagger will then load from the specified model file rather than saving to it.

To be clear: your training corpus should be provided with the property trainFile. See the tagger properties files included with the Stanford Tagger for examples.

2
On

Here are two links that can help you, describing step-by-step instructions on how to create (train) your tagger:

  1. https://medium.com/@klintcho/training-a-swedish-pos-tagger-for-stanford-corenlp-546e954a8ee7
  2. http://www.florianboudin.org/wiki/doku.php?id=nlp_tools_related&DokuWiki=9d6b70b2ee818e600edc0359e3d7d1e8

Please note that inside .conf file you should point to your treebank (that is, real-world sentences parsed in a dependency tree format with POS tags and dependency relations). In this same line you should specify your format:

  1. TEXT // represents a tokenized file separated by text
  2. TSV // represents a tsv file such as a conll file
  3. TREES // represents a file in PTB format

In my case, I used a CoNLL file, which is a TAB-SEPARATED-VALUES format (TSV). I must confess that couldn't find clear documentation and had to appeal to source code.

My config:

model = portuguese.tagger
arch = left3words,naacl2003unknowns,allwordshapes(-1,1)
trainFile = format=TSV,wordColumn=1,tagColumn=4,C:\\path\\universal-dev.conll
closedClassTagThreshold = 40
curWordMinFeatureThresh = 2
tagSeparator = _
encoding = utf-8   # that's because I based my config on spanish!
iterations = 100
lang = spanish
learnClosedClassTags = false
minFeatureThresh = 2
openClassTags = 
rareWordMinFeatureThresh = 10
rareWordThresh = 5
search = qn
sgml = false
sigmaSquared = 0.0
regL1 = 0.75
tokenize = true
tokenizerOptions = asciiQuotes
verbose = false
verboseResults = false
veryCommonWordThresh = 250
xmlInput = null
outputFormat = slashTags
nthreads = 16