For the problem of named entity recognition,
After tokenizing the sentences, how do you set up the columns? it looks like one column in the documentation is POS tag, but where do these come from? Am I supposed to tag the POS myself or is there a tool to generate these?
What is the next column represent? A class like PERSON, LOCATION, etc? and does it have to be in any particular format?
Is there any example of a completed training file and template for NER?
You can find example training and test data in the crf++ repo here. The training data for noun phrase chunking looks like this:
The columns are arbitrary in that they can be anything. CRF++ requires that every line have the same number of columns (or be blank, to separate sentences), not all CRF packages require that. You will have to provide the data values yourself; they are the data the classifier learns from.
While anything can go in the various columns, one convention you should know is IOB Format. To deal with potentially multi-token entities, you mark them as Inside/Outside/Beginning. It may be useful to give an example. Pretend we are training a classifier to detect names - for compactness I'll write this on one line:
In columnar format it would look like this:
With these tags,
B
(beginning) means the word is the first in an entity,I
means a word is inside an entity (it comes after aB
tag), andO
means the word is not an entity. If you have more than one type of entity it's typical to use labels likeB-PERSON
orI-PLACE
.The reason for using IOB tags is so that the classifier can learn different transition probabilities for starting, continuing, and ending entities. So if you're learning company names It'll learn that
Inc./I-COMPANY
usually transitions to anO
label becauseInc.
is usually the last part of a company name.Templates are another problem and CRF++ uses its own special format, but again, there are examples in the source distribution you can look at. Also see this question.
To answer the comment on my answer, you can generate POS tags using any POS tagger. You don't even have to provide POS tags at all, though they're usually helpful. The other labels can be added by hand or automatically; for example, you can use a list of known nouns as a starting point. Here's an example using spaCy for a simple name detector: