Currently i've a bunch of .txtfiles. within each .txt files, each sentence is separated by newline. how do i change it to the IMS CWB format so that it's readable by CWB? and also to nltk format.
Can someone lead me to a howto page to do that? or is there a guide page to do that, i've tried reading through the manual but i dont really know. www.cwb.sourceforge.net/files/CWB_Encoding_Tutorial.pdf
Does it mean i create a data and registry directory and then i run the cwb-encode command and it will be all converted to vrt file? does it convert one file at a time? how do i script it to run through multiple file in a directory?
It's easy to produce cwb's "verticalized" format from an NLTK-readable corpus:
From there, you can follow the instructions on the CWB website.