Vector Space Model query - set of documends search

486 Views Asked by At

i'm trying to write a code for vsm search in c. So using a collection of documents i built a hashtable (inverded index) in wich each slot holds a word along with it's df and a pointer to a list in which each slot hold a name of a document(in which the word appeared at least once) along with the tf(how many times it appeared in this doccument). The user will write a question(also chooses weighting qqq.ddd and comparing method but that doesn't matter for my question) and i have to print him the documents that are relevant to it(from the most relevant to the least relevant). So the examples i've seen are showing which are the steps having only one document for example: we have a collection of 1.000.000 documents(N=1.000.000) and we want to compare

1 document: car insurance auto insurance
with the queston: best car insurance

So in the example it creates an array like this:

Term     | Query |   Document
         |  tf   |      tf
auto     |  0    |      1
best     |  1    |      0 
car      |  1    |      1
insurance|  1    |      2

The example also gives the df for each term so using these clues and the weighting and comparing methods it's easy to compare them turning them into vectors by finding the 4 coordinates(1 for each word in the array). So in this example there are 1.000.000 documents and to see how relevant the document with the query is we use 1 time each(4 words) of the words that there are in the query and in the document. So we have to find 4 coordinates and then compare. In what i'm trying to do there are like 8000 documents each of them having from 3 to 50 words. So how am i suppose to compare how relevant is a query with each document? If i have

a query: ping pong 
document 1: this is ping kong
document 2: i am ping tongue

To compare the query-document1 i will use the words: this is ping kong pong (so 5 coordinates) and to compare the query-document2 i will use the words: i am ping tongue is kong (6 coordinates) and then since i use the same comparing method the one with the highest score is the most relevant? OR do i have to use for both the words: this is ping kong am tongue kong (7 coordinates)? So my question is which is the right way to compare all these 8000 documents with the question? I hope i succeed on making my question easy to understand. thank you for your time!

0

There are 0 best solutions below