How to prepare test data for textsum?

344 Views Asked by At

I have been able to successfully run the pre-trained model of TextSum (Tensorflow 1.2.1). The output consists of summaries of CNN & Dailymail articles (which are chuncked into bin format prior to testing).

I have also been able to create the aforementioned bin format test data for CNN/Dailymail articles & vocab file (per instructions here). However, I am not able to create my own test data to check how good the summary is. I have tried modifying the make_datafiles.py code to remove had coded values. I am able to create tokenized files, but the next step seems to be failing. It'll be great if someone can help me understand what url_lists is being used for. Per the github readme -

"For each of the url lists all_train.txt, all_val.txt and all_test.txt, the corresponding tokenized stories are read from file, lowercased and written to serialized binary files train.bin, val.bin and test.bin. These will be placed in the newly-created finished_files directory."

How is a URL such as http://web.archive.org/web/20150401100102id_/http://www.cnn.com/2015/04/01/europe/france-germanwings-plane-crash-main/ being mapped to the corresponding story in my data folder? If someone has had success with this, please do let me know how to go about this. Thanks in advance!

1

There are 1 best solutions below

0
KRW4 On BEST ANSWER

Update: I was able to figure out how to use own data to create bin files for testing (and avoid using url_lists altogether).

This will be helpful - https://github.com/dondon2475848/make_datafiles_for_pgn

Will update answer once I figure out how to fix ROGUE scoring for this.