Split multi-paragraph documents into paragraph-numbered sentences

647 Views Asked by At

I have a list of well-parsed, multi-paragraph documents (all paragraphs separated by \n\n and sentences separated by ".") that I'd like to split into sentences, together with a number indicating the paragraph number within the document. For example, the (two paragraph) input is:

First sentence of the 1st paragraph. Second sentence of the 1st paragraph. \n\n 

First sentence of the 2nd paragraph. Second sentence of the 2nd paragraph. \n\n

Ideally the output should be:

1 First sentence of the 1st paragraph. 

1 Second sentence of the 1st paragraph. 

2 First sentence of the 2nd paragraph.

2 Second sentence of the 2nd paragraph.

I'm familiar with the Lingua::Sentences package in Perl that can split documents into sentences. However it is not compatible with paragraph numbering. As such I'm wondering if there's an alternative way to achieve the above (the documents contains no abbreviations). Any help is greatly appreciated. Thanks!

2

There are 2 best solutions below

3
On BEST ANSWER

As you mentioned Lingua::Sentences, I think it's an option to manipulate the original output from this module a little bit to get what you need

use Lingua::Sentence;

my @paragraphs = split /\n{2,}/, $splitter->split($text);

foreach my $index (0..$#paragraphs) {
    my $paragraph = join "\n\n", map { $index+1 . " $_" } 
        split /\n/, $paragraphs[$index];
    print "$paragraph\n\n";
}
0
On

If you can rely on period . being the delimiter, you can do this:

perl -00 -nlwe 'print qq($. $_) for split /(?<=\.)/' yourfile.txt

Explanation:

  • -00 sets the input record separator to the empty string, which is paragraph mode.
  • -l sets the output record separator to the input record separator, which in this case translates to two newlines.

Then we simply split on period with a lookbehind assertion and print the sentences, preceded by the line number.