I am trying to "recreate" music lyrics from term frequency counts. I have two source data files. The first is simply a list of the 5000 most-used terms in the corpus of lyrics I'm using, ranked in order from most used (1) to least used (5000). The second file is the lyrics corpus itself, composed of over 200,000 songs.
Each "song" is a comma-delimited string as follows:
SONGID1,SONGID2,1:13,2:10,4:6,7:15,....
where the first two entries are the ID tags of the song, followed by the terms (the numbers to the left of the colons) and the number of times that term is used in the song (the numbers to the right of the colons). In the example above, this would mean that "I" (the first entry "1" in the 5000 most-used terms) occurs 13 times in this given song, while "the" (the second-most used term) occurs 10 times, and so on.
What I want to do is go from this termID:termCount
format to actually "recreating" the original (albeit scrambled) lyrics, where I set the numbers to the left of the colons to the actual terms and then list these terms the proper number of times given the term counts to the right of the colons. Again, using the short example above, my preferred resulting output would be:
SONGID1, SONGID2, I I I I I I I I I I I I I the the the the the the the the the the and and and and and and and...
and so on. Thanks!
Perhaps the following (untested) will inspire you. You didn't say how you wanted it outputted, so you may want to change the
print()
s to file writes or something.