LSH - Binary matrix representation from shingles

360 Views Asked by At

I have a large dataset of news articles, 48000 to be precise. I have made ngrams of each article where n = 3. my ngrams look like this:

[[(tikro, enters, into), (enter, into, research), (into, research, and),...]] 

now I need to make a binary matrix of each shingle and article:

          article1 article2 article3
shingle1     1        0        0
shingle2     1        0        1
shingle3     0        1        0

At first I have kept all the shingles in a single list. After that, I have tried this to check if it works.

for art in article:
    for sh in ngrams:
        if sh in art:
            print('found')

as one is set and another is string it does not work. any suggestions, how to make it work? or any other approach?

thank you

1

There are 1 best solutions below

2
On BEST ANSWER

Before searching shingles in articles you could use join to concatenate words of a shingle into a 3-word-phrase.

For example we have ngrams like:

ngrams = [('tikro', 'enters', 'into'),
          ('enter', 'into', 'research'),
          ('into', 'research', 'and')]

Then we concatenate words into phrase for each shingle:

shingles = [' '.join(x) for x in ngrams]

After the transformation the shingles is something like:

['tikro enters into', 
 'enter into research', 
 'into research and']

which are strings you could search in your articles.