I need to add begin and end sentence markers to some texts that I analyze using Quanteda.
I would like to add these markers using Quanteda but I do not see an explicit way to do that "out of the box".
Searching for an answer I found a different question about quanteda and these markers here. Another question about markers here strengthens my guess that this task is done "manually".

This is to ask what is currently the best way to add such markers using Quanteda and what advantages ("NLP intelligence" ?) and disadvantages (lower speed, memory) it would have compared to doing that in custom code.

I am mostly interested in the general answer, any additional advice about the specifics of my case are most welcome, they are:

  • Texts size: very large, for instance when trying to segment texts to sentences Quanteda was still running after 2-3 hours and I always had to kill the session.

  • I would like to use Quanteda but not at all costs, I am comfortable coding in R, Python, Java and with regexes and if other non-huge packages bring relevant advantages I have no problems in learning and using them for this task (text2vec?).


    Sample of input and desired output.
    Using "sss" and "eee" as begin and end sentence markers:
    input:
    CENTERS FOR DISEASE CONTROL AND PREVENTION (CDC). Outbreak of influenza A in a nursing home - New York, Dec. 1991-Jan. 1992. MMWR Morb Mortal Wkly Rep 1992; 18: 129-31.
    desired output:
    sss CENTERS FOR DISEASE CONTROL AND PREVENTION (CDC) eee sss Outbreak of influenza A in a nursing home - New York, Dec. 1991-Jan. 1992 eee sss MMWR Morb Mortal Wkly Rep 1992; 18: 129-31 eee

0

There are 0 best solutions below