How to add a custom corpora to local machine in nltk

2.4k Views Asked by At

I have a custom corpora that created with data which i need to do some classification. I have the dataset in a same format that movie_reviews corpora contains. According to nltk documentation i use following code to access to movie_reviews corpora. Is there anyway to add any custom corpora to nltk_data/corpora directory and access that corpora as the same way we access existing corpora.

    import nltk
    from nltk.corpus import movie_reviews

    documents = [(list(movie_reviews.words(fileid)), category)
         for category in movie_reviews.categories()
         for fileid in movie_reviews.fileids(category)]
3

There are 3 best solutions below

0
On BEST ANSWER

While you could hack the nltk to make your corpus look like a built-in nltk corpus, that's the wrong way to go about it. The nltk provides a rich collection of "corpus readers" that you can use to read your corpora from wherever you keep them, without moving them to the nltk_data directory or hacking the nltk source. The nltk's own corpora use the same corpus readers behind the scenes, so your reader will have all the methods and behavior of equivalent built-in corpora.

Let's see how the movie_reviews corpus is defined in nltk/corpora/__init__.py:

movie_reviews = LazyCorpusLoader(
    'movie_reviews', CategorizedPlaintextCorpusReader,
    r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*',
    encoding='ascii')

You can ignore the LazyCorpusLoader part; it's for providing corpora that your program will most likely never use. The rest shows that the movie review corpus is read with a CategorizedPlaintextCorpusReader, that its files all end in .txt, and that the reviews are sorted into categories through being in the subdirectories pos and neg. Finally, the corpus encoding is ascii. So define your own corpus like this (changing values as needed):

mycorpus = nltk.corpus.reader.CategorizedPlaintextCorpusReader(
    r"/home/user/path/to/my_corpus",
    r'(?!\.).*\.txt', 
    cat_pattern=r'(neg|pos)/.*',
    encoding="ascii")

That's it; you can now call mycorpus.words(), mycorpus.sents(categories="neg"), etc., just as if this was a corpus provided by the nltk.

3
On

First put the actual data from your new corpus into your nltk_data/corpora/ directory. Then you have to edit the __init__.py file for nltk.corpus. You can find the path to this file by doing:

import nltk
print(nltk.corpus.__file__)

Open this file in a text editor and you will see that most of the file is creating LazyCorpusLoader objects and assigning them to global variables.

So for example, a section may look like:

....
verbnet = LazyCorpusLoader(
    'verbnet', VerbnetCorpusReader, r'(?!\.).*\.xml')
webtext = LazyCorpusLoader(
    'webtext', PlaintextCorpusReader, r'(?!README|\.).*\.txt', encoding='ISO-8859-2')
wordnet = LazyCorpusLoader(
    'wordnet', WordNetCorpusReader,
    LazyCorpusLoader('omw', CorpusReader, r'.*/wn-data-.*\.tab', encoding='utf8'))  
....

In order to add a new corpus you just have to add a new line to this file in the same format as the examples above. So if you have a corpus named movie_reviews and you have the data saved in nltk_data/corpora/movie_reviews then you would want to add a line like:

movie_reviews = LazyCorpusLoader('movie_reviews', .... )

Additional arguments for LazyCorpusLoader can be found in the docs here.

Then you just save this file and you should then be able to do:

from nltk.corpus import movie_reviews
0
On

Ok, so I had a bit of a problem with the solution provided and I find the easiet way that worked for me is to first create my folders and subfolder in the 'corpora' directory and then edit the init.py doc.

so in my case the corpus I created was vc and the subfolders were audio_them, audio_us, video_them, video_us

vc = LazyCorpusLoader(
    'vc', CategorizedPlaintextCorpusReader,
    r'(?!\.).*\.txt', 
cat_pattern = r'(audio_them|audio_us|video_them|video_us)/.$
    encoding="ascii")