Sample data or corpora for testing text processing functions?

168 Views Asked by At

I'm wondering if there are online sample texts that can be used for testing algorithms. For example, I'm whipping up a simple tokenization function and want to make sure it works for special cases like mid-word punctuation characters ("don't", "O'Brien"), dashes (for my purposes, "Sacksville-Bagginses" should be a single token), international characters, etc.

Similarly, it would be nice when whipping up other algorithms to have documents at-hand that are ideal for testing them, instead of having to either whip up or searching for good sample texts in Gutenberg.

Also useful would be text that could be used for testing things like spelling & grammar tools, etc.

1

There are 1 best solutions below

0
On

There are a bunch of text corpora listed in this Wikipedia entry. Also, there are some good pointers in the NLTK corpora list. And you might want to check out the Google ngram datasets.