How Can I Access the Brown Corpus in Java (aka outside of NLTK)

1k Views Asked by At

I'm trying to write a program that makes use of natural language parts-of-speech in Java. I've been searching on Google and haven't found the entire Brown Corpus (or another corpus of tagged words). I keep finding NLTK information, which I'm not interested in. I want to be able to load data into a Java program and sum up occurrences of words (and what % likelihood they are to be what part-of-speech).

I do not want to use a Java library like the Stanford one, I want to play with the corpus data myself.

3

There are 3 best solutions below

1
On BEST ANSWER

Here's a link to the download page for the Brown Corpus: http://www.nltk.org/nltk_data/

All the files are zip files. The data format is described on the Brown Corpus Wikipedia. I dunno what else to say. From there things should be obvious.

EDIT: if you want original source data, I think there's some corpuses out there that have their data. However usually the point is to let someone else do the sampling. Also, note this from the the Wikipedia entry: "Each sample began at a random sentence-boundary in the article or other unit chosen, and continued up to the first sentence boundary after 2,000 words." So the data for the Brown Corpus is essentially randomized. Even if you had the original texts you might not be able to guess where they sampled.

3
On

If you don't want to mess with the NLTK interface: The Brown corpus has been deposited at the Internet Archive (archive.org). On https://archive.org/details/BrownCorpus you'll find a link to a zip archive containing the entire corpus. (Also a torrent link, but it doesn't seem worth the trouble for 3.2 MB.)

4
On

Data is data. The NLTK data is not in an obscure, encrypted, or difficult format. Just write java code to read it. You might find a shortcut in WEKA, or you might not.