I am using both R
and Python
and trying to learn Text based analytics and NLP
to some extent.
Question: How do I split a sentence which is a mix of sentences like below
Sentence = I like the application i like the system i do not like the process being followed.
I want to split this sentence into
- I like the application
- i like the system
- i do not like the process being followed
Note: I am able to split a sentence like below as it has a .
to indicate end of a sentence
Sentence = I like the application. I like the system. I do not like the process being followed.
Vj
I can propose an approach that can help you, since you don't have sentence delimiter, you can proceed as follow:
Apply a syntactic analyzing to extract the syntactic nature of the paragraph.
Example: I like the application i like the system i do not like the process being followed
will produce: PP VB DT NN...
To extract the syntactic analyzing I recommend to use Stanford Parser.
PP: Personal Pronoun
VB: VerB
DT: DeTerminer
NN: NouN
You can see that a sentence has a syntactic pattern that can be used to split a paragraph into sentences.
Build a model of possible syntactic tree of a sentence. By saying a model I mean a file/database that contains syntactic build of sentences.
Example: a model can contain the following lines:
PP VB DT NN --> (I eat an apple)
VB ADJ NN --> (create new methods)
To construct your model you can analyze many sentences (the larger is your set of sentences the more accurate will be your system). You can use a corpus built by your own self.
Once you have build your model, you can start writing your program. The main lines of your algorithm will be:
1- Receive the input paragraph (as an input or file).
2- Apply Stanford Parser to produce the syntactic tree of the paragraph.
3- Start splitting your paragraph based on on the comparison of parts of the paragraph with previously constructed syntactic tree (your sentences model --> your pattern)
You will need to measure the similarity of a part of the paragraph with a sentence-model.
I tried to give you an idea/approach on how to do what you want to do.
Probably you will need to work with NLTK (Natural Language Toolkit).