How Can I Access the Brown Corpus in Java (aka outside of NLTK)

1.1k Views Asked by Nate Cook3 At 06 June 2015 at 17:03

I'm trying to write a program that makes use of natural language parts-of-speech in Java. I've been searching on Google and haven't found the entire Brown Corpus (or another corpus of tagged words). I keep finding NLTK information, which I'm not interested in. I want to be able to load data into a Java program and sum up occurrences of words (and what % likelihood they are to be what part-of-speech).

I do not want to use a Java library like the Stanford one, I want to play with the corpus data myself.

Original Q&A

There are 3 best solutions below

markspace On 06 June 2015 at 17:18 BEST ANSWER

Here's a link to the download page for the Brown Corpus: http://www.nltk.org/nltk_data/

All the files are zip files. The data format is described on the Brown Corpus Wikipedia. I dunno what else to say. From there things should be obvious.

EDIT: if you want original source data, I think there's some corpuses out there that have their data. However usually the point is to let someone else do the sampling. Also, note this from the the Wikipedia entry: "Each sample began at a random sentence-boundary in the article or other unit chosen, and continued up to the first sentence boundary after 2,000 words." So the data for the Brown Corpus is essentially randomized. Even if you had the original texts you might not be able to guess where they sampled.

bmargulies On 06 June 2015 at 17:08

Data is data. The NLTK data is not in an obscure, encrypted, or difficult format. Just write java code to read it. You might find a shortcut in WEKA, or you might not.

alexis On 13 June 2015 at 20:10

If you don't want to mess with the NLTK interface: The Brown corpus has been deposited at the Internet Archive (archive.org). On https://archive.org/details/BrownCorpus you'll find a link to a zip archive containing the entire corpus. (Also a torrent link, but it doesn't seem worth the trouble for 3.2 MB.)

How Can I Access the Brown Corpus in Java (aka outside of NLTK)

There are 3 best solutions below

Related Questions in JAVA

Related Questions in NLP

Related Questions in NLTK

Related Questions in CORPUS

Related Questions in TAGGED-CORPUS

Trending Questions

Popular # Hahtags

Popular Questions