I am trying to split the sentences into clauses: Main clause + subordinate clause. I suppose the clause is grammatical unit that contains a verb. I have used the standford parser tool to extract the parse trees of sentences, then I extracted all branches that contain a VP subbranch. The extraction of branches are based on the labels: ["S", "SBAR","SBARQ","SINV","SQ"]. However, I do not know how to split the sentences after then, because those branches containing VP are often nested and the branches only include the subordinate clause, not the main clause. I hope to keep all the components in the sentence unchanged, just split it into independent clauses without duplications in the clauses. There are some examples in my situation:
-raw sentence: "While it has been suggested that JC virus (JCV) migrates in B-lymphocytes from the kidney to the central nervous system where it initiates demyelination, this phase of JCV pathogenesis has not been systematically explored." -clauses: (1)"While it has been suggested that JC virus (JCV) migrates in B-lymphocytes from the kidney to the central nervous system where it initiates demyelination" (2)", this phase of JCV pathogenesis has not been systematically explored."
-raw sentence: "To investigate whether in situ T cell growth plays a relevant role in the pooling of CD8+ lymphocytes, we have analyzed the activity of two lymphokines involved in the mechanisms of T cell proliferation, i.e., interleukin-2 (IL-2) and interleukin-4." -clauses: (1)"To investigate whether in situ T cell growth plays a relevant role in the pooling of CD8+ lymphocytes" (2)", we have analyzed the activity of two lymphokines involved in the mechanisms of T cell proliferation, i.e., interleukin-2 (IL-2) and interleukin-4."
-raw sentence: "However, a subgroup of patients who potently suppressed viremia independently of STI had significantly higher pre-existing neutralization titers, suggesting a role of humoral immunity in conferring potent protection." -clauses: (1)"However, a subgroup of patients who potently suppressed viremia independently of STI had significantly higher pre-existing neutralization titers" (2)", suggesting a role of humoral immunity in conferring potent protection."
the following are my code:
from nltk import Tree
import re
from pycorenlp import *
nlp=StanfordCoreNLP("http://localhost:9011/")
clause_level_list = ["S", "SBAR","SBARQ","SINV","SQ"]
sent = "While it has been suggested that JC virus (JCV) migrates in B-lymphocytes from the kidney to the central nervous system where it initiates demyelination, this phase of JCV pathogenesis has not been systematically explored."
parser = nlp.annotate(sent, properties={"annotators":"parse","outputFormat": "json"})
sent_tree = nltk.tree.ParentedTree.fromstring(parser["sentences"][0]["parse"])
sent_tree.pretty_print()
subtexts = []
for subtree in reversed(list(sent_tree.subtrees())):
if subtree.parent() is not None and subtree.parent().label() != 'ROOT' and subtree.label() in clause_level_list and any(child.label() == 'VP' for child in subtree.subtrees()):
# print(subtree.leaves())
subtexts.append(' '.join(subtree.leaves()))
for s in reversed(subtexts):
print(s)
Thank you in advance! I would be very appreciate if somebody would help me to solve this problem, since this is very important for my doctorial thesis.