Recreating lyrics (words) from term frequency counts (numbers)

138 Views Asked by At

I am trying to "recreate" music lyrics from term frequency counts. I have two source data files. The first is simply a list of the 5000 most-used terms in the corpus of lyrics I'm using, ranked in order from most used (1) to least used (5000). The second file is the lyrics corpus itself, composed of over 200,000 songs.

Each "song" is a comma-delimited string as follows:

SONGID1,SONGID2,1:13,2:10,4:6,7:15,....

where the first two entries are the ID tags of the song, followed by the terms (the numbers to the left of the colons) and the number of times that term is used in the song (the numbers to the right of the colons). In the example above, this would mean that "I" (the first entry "1" in the 5000 most-used terms) occurs 13 times in this given song, while "the" (the second-most used term) occurs 10 times, and so on.

What I want to do is go from this termID:termCount format to actually "recreating" the original (albeit scrambled) lyrics, where I set the numbers to the left of the colons to the actual terms and then list these terms the proper number of times given the term counts to the right of the colons. Again, using the short example above, my preferred resulting output would be:

SONGID1, SONGID2, I I I I I I I I I I I I I the the the the the the the the the the and and and and and and and...

and so on. Thanks!

1

There are 1 best solutions below

0
On

Perhaps the following (untested) will inspire you. You didn't say how you wanted it outputted, so you may want to change the print()s to file writes or something.

//assumes that each word is on its own line, sorted from most to least common
String[] words = loadStrings("words.txt");

//two approaches: 
//loadStrings() again, but a lot of memory usage for big files. 
//buffered reader, which is more complicated but works well for large files.
BufferedReader reader = createReader("songs.txt");
String line = reader.readLine();
while(line != null){
  String[] data = line.split(",");
  print(data[0] + ", " + data[1]); //the two song IDs
  for(int i = 2; i < data.length; i++){ 
    String[] pair = data[i].split(":");
    // inelegant, but clear. You may have to subtract 1, if
    // the words index from 1 but the array indexes from 0
    for(int j = 0; j < int(pair[1]); j++)
      print(words[int(pair[0])] + " ");
  }
  println();
  line = reader.readLine();
}
reader.close();