NLTK chunked parse tree, save it into a file and loading it with CorpusReader class

1.3k Views Asked by At

Let's say I have a chunked corpus like below, and it is saved in a file called test.txt

[Rapunzel/NNP] let/VBD down/RP [her/PP$ long/JJ golden/JJ hair/NN]

then I can load it with ChunkedCorpusReader.

>>> from nltk.corpus.reader import ChunkedCorpusReader
>>> reader = ChunkedCorpusReader('.','test.txt')
>>> reader.chunked_sents()[0]
Tree('S', [Tree('NP', [('Rapunzel', 'NNP')]), ('let', 'VBD'), ('down', 'RP'), Tree('NP', [('her', 'PP$'), ('long', 'JJ'), ('golden', 'JJ'), ('hair', 'NN')])])
>>> print(reader.chunked_sents()[0])
(S
  (NP Rapunzel/NNP)
  let/VBD
  down/RP
  (NP her/PP$ long/JJ golden/JJ hair/NN))

and I made some change on the Tree object, say, switched the chunk tag from NP to NPP and called new.

>>> print(new)
(S
  (NPP Rapunzel/NNP)
  let/VBD
  down/RP
  (NPP her/PP$ long/JJ golden/JJ hair/NN))

and Now I want to do is save this new Tree in a file and load it with ChunkedCorpusReader or any other readers, as I did with test.txt. However, I couldn't find a way to save NLTK Tree object in a file, and moreover, read it from a file. Anyone can help?

1

There are 1 best solutions below

0
On BEST ANSWER

The default conversion to string, which print gave you, is not bad: It merges words with POS tags, and indents new lines properly. Since file.write() doesn't automatically convert to string, you must pass str(newtree) to the file's write method.

For more control over the appearance of the tree's string representation, use the tree method pformat(). Note that Tree.pformat() was called Tree.pprint() in earlier versions of the nltk; in the latest version, Tree.pformat() returns a string while Tree.pprint() writes to stdout.

If you want your tree to be delimited by square brackets, add the option parens="[]" to pformat().

>>> print(new.pformat(parens="[]"))
[S
  [NP Rapunzel/NNP]
  let/VBD
  down/RP
  [NP her/PP$ long/JJ golden/JJ hair/NN]]