How to detect questions similarity using Locality Sensitive Hashing?

612 Views Asked by At

We are trying to implement question similarity detection using Locality Sensitive Algorithm. We are using lshash python package.

our objective is to achieve similarly "How question suggestions works on Stackvoerflow"

Following is our sample data text file.

    The food didn't taste very good, and actually I don't feel very well now
He can pull strings for you
I saw him???
The blue SUV in front of the Honda
I gave my seat to the old lady
Susan spent the summer vacation at her grandmother's
Do you want anything to eat? 
A water molecule has two hydrogen atoms and one oxygen atom
He's away on business
Are you here for work? 
I had a strange dream last night
The boy began to cry
She pointed her finger at him
No matter who says so, it's not true
May I have a receipt? 
She loves him
Where is the nearest bank? 
Tired from the hard work, he went to bed earlier than usual
He has not written to them for a long time
Do you have any brothers? 
I have to buy a new pair of skis
Winter is my favorite season
Why did this happen? 
Tom seems very happy
It was cold, so we lit a fire
I look forward to my birthday
She attacked him with a baseball bat
You're a really good cook
That's too much
I expect a subway station will be here in the future
what is photosynthesis?
what is mathematics?
do you know about photosynthesis?

following is the python code

    from lshash import LSHash
from nltk.corpus import stopwords

#CONSTANTS
HASH_SIZE = 16
INPUT_DIMENSION = 50
NUM_HASHTABLES = 20
INPUT_FILE = 'test-cases.txt'

lsh = LSHash(HASH_SIZE,INPUT_DIMENSION,NUM_HASHTABLES)
cachedStopWords = stopwords.words("english")
dict_questions = {}
dict_no_stop_questions = {}
dict_ascii_questions = {}
def remove_stop(text):
    return ' '.join([word for word in text.split() if word not in cachedStopWords])
def remove_special_chars(text):
    return ''.join(e for e in text if (e.isalnum() or e.isspace()))
def append_dummy(arr):
    if len(arr)<INPUT_DIMENSION:
        for x in range(INPUT_DIMENSION-len(arr)):
            arr.append(0)

def get_original_form(search_item):
    f_key = -1
    for key, value in dict_ascii_questions.iteritems():
        if value[:INPUT_DIMENSION] == list(search_item[0]):
            f_key = key
            break
    if f_key!=-1:
        return dict_questions[f_key] + " # " +dict_no_stop_questions[f_key]
    else:
        return ""
file = open(INPUT_FILE, 'r')
questions = file.readlines()
index = 0
for question in questions:
    dict_questions[index] = question;
    dict_no_stop_questions[index] = remove_stop(remove_special_chars(question.lower()))
    value = [ord(c) for c in dict_no_stop_questions[index]]
    if len(value)<INPUT_DIMENSION:
        append_dummy(value)
    dict_ascii_questions[index] = value
    index = index + 1
for key,value in dict_ascii_questions.iteritems():
    lsh.index(value[:INPUT_DIMENSION])
query = raw_input("Type n for  exit. Input Query? =")
while query!="n":
    aq = [ord(c) for c in remove_stop(remove_special_chars(query.lower()))]
    append_dummy(aq)
    results = lsh.query(aq[:INPUT_DIMENSION],5)
    print "Found : " + str(len(results))
    for result in results:
        print "Rank: " + str(result[1])+ "  " + get_original_form(result)
    query = raw_input("Type n for  exit. Input Query? =")

But this implementation is giving bad results. Can some one guide us which type of Locality sensitive Hash algorithm to he used on out context? I am confused with the param: INPUT_DIMENSION.

0

There are 0 best solutions below