How to train NLTK PunktSentenceTokenizer batchwise?

2.1k Views Asked by At

I am trying to split financial documents to sentences. I have ~50.000 documents containing plain English text. The total file size is ~2.6 GB.

I am using NLTK's PunktSentenceTokenizer with the standard English pickle file. I additionally tweaked it with providing additional abbreviations but the results are still not accurate enough.

Since NLTK PunktSentenceTokenizer bases on the unsupervised algorithm by Kiss & Strunk (2006) I am trying to train the sentence tokenizer based on my documents, based on training data format for nltk punkt.

import nltk.tokenize.punkt
import pickle
import codecs

tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()
text = codecs.open("someplain.txt", "r", "utf8").read()
tokenizer.train(text)
out = open("someplain.pk", "wb")
pickle.dump(tokenizer, out)
out.close()

Unfortunately, when running the code, I got an error, that there is not sufficient memory. (Mainly because I first concatenated all the files to one big file.)

Now my questions are:

  1. How can I train the algorithm batchwise and would that lead to a lower memory consumption?
  2. Can I use the standard English pickle file and do further training with that already trained object?

I am using Python 3.6 (Anaconda 5.2) on Windows 10 on a Core I7 2600K and 16GB RAM machine.

2

There are 2 best solutions below

0
On BEST ANSWER

I found this question after running into this problem myself. I figured out how to train the tokenizer batchwise and am leaving this answer for anyone else looking to do this. I was able to train a PunktSentenceTokenizer on roughly 200GB of Biomedical text content in around 12 hours with a memory footprint no greater than 20GB at a time. Nevertheless, I'd like to second @colidyre's recommendation to prefer other tools over the PunktSentenceTokenizer in most situations.

There is a class PunktTrainer you can use to train the PunktSentenceTokenizer in a batchwise fashion.

from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktTrainer

Suppose we have a generator that yields a stream of training texts

texts = text_stream()

In my case, each iteration of the generator queries a database for 100,000 texts at a time, then yields all of these texts concatenated together.

We can instantiate a PunktTrainer and then begin training

trainer = PunktTrainer()
for text in texts:
    trainer.train(text)
    trainer.freq_threshold()

Notice the call to the freq_threshold method after processing each text. This reduces the memory footprint by cleaning up information about rare tokens that are unlikely to influence future training.

Once this is complete, call the finalize training method. Then you can instantiate a new tokenizer using the parameters found during training.

trainer.finalize_training()
tokenizer = PunktSentenceTokenizer(trainer.get_params())

@colidyre recommended using spaCy with added abbreviations. However, it can be difficult to know which abbreviations will appear in you text domain in advance. To get the best of both worlds you can add the abbreviations found by Punkt. You can get a set of these abbreviations in the following way

params = trainer.get_params()
abbreviations = params.abbrev_types
5
On

As described in the source code:

Punkt Sentence Tokenizer

This tokenizer divides a text into a list of sentences by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences. It must be trained on a large collection of plaintext in the target language before it can be used.

It is not very clear what a large collection really means. In the paper, there are no information given about learning curves (when it is sufficiant to stop learning process, because enough data was seen). Wall Street Journal corpus is mentioned there (it has approximately 30 million words). So it is very unclear if you can simply trim your training corpus and have less memory footprints.

There is also an open issue on your topic saying something about 200 GB RAM and more. As you can see there, NLTK has probably not a good implementation of the algorithm presented by Kiss & Strunk (2006).

I cannot see how to batch it, as you can see in the function signature of train()-method (NLTK version 3.3):

def train(self, train_text, verbose=False):
    """
    Derives parameters from a given training text, or uses the parameters
    given. Repeated calls to this method destroy previous parameters. For
    incremental training, instantiate a separate PunktTrainer instance.
    """

But there are probably more issues, e.g. if you compare the signature of given version 3.3 with the git tagged version 3.3, there is a new parameter finalize which might be helpful and indicates a possible batch-process or a possible merge with an already trained model:

def train(self, text, verbose=False, finalize=True):
    """
    Collects training data from a given text. If finalize is True, it
    will determine all the parameters for sentence boundary detection. If
    not, this will be delayed until get_params() or finalize_training() is
    called. If verbose is True, abbreviations found will be listed.
    """

Anyway, I would strongly recommend not using NLTK's Punkt Sentence Tokenizer if you want to do sentence tokenization beyond playground level. Nevertheless, if you want to stick to that tokenizer, I would simply recommend using also the given models and not train new models unless you have a server with huge RAM memory.