Free Tagged Corpus for Named Entity Recognition

10.7k Views Asked by DantheMan At 25 July 2010 at 17:27

I am looking for a free tagged corpus for a system to train on to for Named Entity Recognition. Most of the ones I find (like the New York Times one) are expensive and not open. Can anyone help?

Original Q&A

There are 3 best solutions below

ankitjaininfo On 25 July 2010 at 17:35

dbPedia is open and free

dbPedia is built from WikiPedia and it is a very big corpus. Build an Lucene index on triples involving rdfs:label on all dbPedia titles dump.

AndreiM On 20 March 2011 at 23:00

The Python NLTK has access to the nltk.corpus.conll2000 corpus. Calling conll2000.iob_words() returns a list of (word, part-of-speech, IOB) triples, where IOB is a tag in the Inside-entity/Outside-entity/Beginning-of-entity format.

There are about 250k total words in a newswire-style context.

Tom Morris On 12 July 2012 at 20:42

There's a list of corpora at http://www.cs.technion.ac.il/~gabr/resources/data/ne_datasets.html

The CoNLL 2003 corpus, which is on that list, is free and is available from http://www.cnts.ua.ac.be/conll2003/ner/ (annotations) and NIST (text).

Free Tagged Corpus for Named Entity Recognition

There are 3 best solutions below

Related Questions in NLTK

Related Questions in CORPUS

Related Questions in NAMED-ENTITY-RECOGNITION

Related Questions in TAGGED-CORPUS

Trending Questions

Popular # Hahtags

Popular Questions