Conversion of Text sentences to CONLL Format

2.2k Views Asked by At

I want to convert the Normal english text into CONLL-U format for maltparser for finding dependency in the text in Python. I tried in java but was failed to do so, below is the format I'm looking for-

String[] tokens = new String[11];
tokens[0] = "1\thappiness\t_\tN\tNN\tDD|SS\t2\tSS";
tokens[1] = "2\tis\t_\tV\tVV\tPS|SM\t0\tROOT";
tokens[2] = "3\tthe\t_\tAB\tAB\tKS\t2\t+A";
tokens[3] = "4\tkey\t_\tPR\tPR\t_\t2\tAA";
tokens[4] = "5\tof\t_\tN\tEN\t_\t7\tDT";
tokens[5] = "6\tsuccess\t_\tP\tTP\tPA\t7\tAT";
tokens[6] = "7\tin\t_\tN\tNN\t_\t4\tPA";
tokens[7] = "8\tthis\t_\tPR\tPR\t_\t7\tET";
tokens[8] = "9\tlife\t_\tR\tRO\t_\t10\tDT";
tokens[9] = "10\tfor\t_\tN\tNN\t_\t8\tPA";
tokens[10] = "11\tsure\t_\tP\tIP\t_\t2\tIP";

I have tried in java but I can not use the standford APIs, I want the same in python.

//This is the example of java code but here the tokens which is created needs to be parsed via code not manually-

MaltParserService service =  new MaltParserService(true);
// in the CoNLL data format.
String[] tokens = new String[11];
tokens[0] = "1\thappiness\t_\tN\tNN\tDD|SS\t2\tSS";
tokens[1] = "2\tis\t_\tV\tVV\tPS|SM\t0\tROOT";
tokens[2] = "3\tthe\t_\tAB\tAB\tKS\t2\t+A";
tokens[3] = "4\tkey\t_\tPR\tPR\t_\t2\tAA";
tokens[4] = "5\tof\t_\tN\tEN\t_\t7\tDT";
tokens[5] = "6\tsuccess\t_\tP\tTP\tPA\t7\tAT";
tokens[6] = "7\tin\t_\tN\tNN\t_\t4\tPA";
tokens[7] = "8\tthis\t_\tPR\tPR\t_\t7\tET";
tokens[8] = "9\tlife\t_\tR\tRO\t_\t10\tDT";
tokens[9] = "10\tfor\t_\tN\tNN\t_\t8\tPA";
tokens[10] = "11\tsure\t_\tP\tIP\t_\t2\tIP";
// Print out the string array
for (int i = 0; i < tokens.length; i++) {
    System.out.println(tokens[i]);
}
// Reads the data format specification file
DataFormatSpecification dataFormatSpecification = service.readDataFormatSpecification(args[0]);
// Use the data format specification file to build a dependency structure based on the string array
DependencyStructure graph = service.toDependencyStructure(tokens, dataFormatSpecification);
// Print the dependency structure
System.out.println(graph);
1

There are 1 best solutions below

0
On

There is now a port of the Standford library to Python (with improvements) called Stanza. You can find it here: https://stanfordnlp.github.io/stanza/

Example of usage:

>>> import stanza
>>> stanza.download('en') # download English model
>>> nlp = stanza.Pipeline('en') # initialize English neural pipeline
>>> doc = nlp("Barack Obama was born in Hawaii.") # run annotation over a sentence