GloVe algorithm: reading the coccurence.bin file contents in Python

19 Views Asked by At

GloVe implementation: I am trying to get the cooccurence values for word pairs by reading the binary cooccurence.bin file using python. This file gets produced in the third step, as a result of running the coocur program.

Has anyone tried this? It seems like there are three values for each pair plus an index:

typedef struct cooccur_rec_id { int word1; int word2; real val; int id;}

When writing, though, I see three values

  • index of the first word (integer - 4 bytes)
  • index of the second word (integer - 4 bytes)
  • cooccurence (real - 4 bytes? 8 bytes?)

This is what I inferred by looking at the cooccur program. I can't seem to get the right number of bytes to read. It looks like reading by 16 bytes at a time, I can get the integers correctly but the coocurance value doesn't make sense.

Anyone tried this? Any help would be appreciated

0

There are 0 best solutions below