How to process text from gutenberg project?

43 Views Asked by At

I'm using C#.I was given a task which is that I need to process txt files of books from project gutenberg here is an excerpt from that task

Each file you should parse to: Sentences; Words; Punctuation. For each file you should generate a new file. The name of that file is the name of the book. In each of those file you should have: Longest sentence by number of characters; Shortest sentence by numbers of words; Longest word; Most common letter; Words sorted by the number of uses in descending order;" How do I omit tables of contents, chapter titles, and other non-sentence elements ? It uses stanford nlp to separate sentences into words

I installed stanford nlp, except that it often treats tables of contents , chapter titles and other phrases that are not actual sentences as sentences.

0

There are 0 best solutions below