I don't understand the point of this function returning two variables, which are the same:
def construct_shingles(doc,k,h):
#print 'antes -> ',doc,len(doc)
doc = doc.lower()
doc = ''.join(doc.split(' '))
#print 'depois -> ',doc,len(doc)
shingles = {}
for i in xrange(len(doc)):
substr = ''.join(doc[i:i+k])
if len(substr) == k and substr not in shingles:
shingles[substr] = 1
if not h:
return doc,shingles.keys()
ret = tuple(shingles_hashed(shingles))
return ret,ret
Seems redundant, but there must be a good reason for it, I just don't see why. Perhaps because there are two return statements? If 'h' is true, does it return both return statements? The calling functions look like:
def construct_set_shingles(docs,k,h=False):
shingles = []
for i in xrange(len(docs)):
doc = docs[i]
doc,sh = construct_shingles(doc,k,h)
docs[i] = doc
shingles.append(sh)
return docs,shingles
and
def shingles_hashed(shingles):
global len_buckets
global hash_table
shingles_hashed = []
for substr in shingles:
key = hash(substr)
shingles_hashed.append(key)
hash_table[key].append(substr)
return shingles_hashed
The data set and function call look like:
k = 3 #number of shingles
d0 = "i know you"
d1 = "i think i met you"
d2 = "i did that"
d3 = "i did it"
d4 = "she says she knows you"
d5 = "know you personally"
d6 = "i think i know you"
d7 = "i know you personally"
docs = [d0,d1,d2,d3,d4,d5,d6,d7]
docsChange,shingles = construct_set_shingles(docs[:],k)
The github location: lsh/LHS
Your guess is correct, and regarding why
return ret,ret
, the answer is that return statement is meant to return a pair of equalling values rather than one.It is more of a style of coding rather than algorithm, because this can be done by other syntaxes. However this one is advantageous in some cases, e.g. if we write
then
func
would be executed twice. But if:then
func
can be executed only once while being able to return to botha
andb
Also in your particular case:
If
h
isfalse
then the program until executes until the linereturn doc,shingles.keys()
, and then the variablesdoc
andsh
inconstruct_set_shingles
respectively take values ofdoc
andshingles.keys()
.Otherwise, the first return statement is omitted, the second one is executed and then both
doc
andsh
take equal values, particularly equalling to the value oftuple(shingles_hashed(shingles))