I have a large dataframe (about 30000 obs) named database_finale. The columns relevant for this post are:
- index1: identifies each observation and is the tag in the doc2vec
- app_date2: is the date of each observation written as an integer
- clean_desc: is the cleaned and tokenized text for the similarity
- snow_pat: is a dummy variable that identifies if the observation belongs to category "snow".
I want to create a new column containing the maximum textual similarity of each "non-snow" observation at its date to all the "snow" observations existing up to the date of the focal observation. As the date increases with the focal observations, so do the reference observations in "snow" category, which increase in number.
After identifying the corpus and training the model, I have tried this code on a subsample of 200 observation and it seems to do the job:
datab=[]
#daytuple is a tuple of app_date2 values, index_tuple is a tuple of index1 values
for i, j in [(i,j) for i in daytuple for j in index_tuple]:
#here I extract the row of the dataframe of the focal non-snow observation for which I want to find the max similarity to snow
m=database_finale.loc[(database_finale.index1 ==j ) & (database_finale.app_date2 == i ) & (database_finale["snow_pat"] !=1)]
result=m.empty
#if the extracted line is non-empty
if result== False:
#here I extract all the "snow" observations up to day i
l=database_finale.loc[(database_finale.app_date2 <= i) & (database_finale["snow_pat"] ==1)]
result2=l.empty
#if also this one is non-empty
if result2==False:
#I create a list of the "snow" references
reference_list1=l["index1"].tolist()
#I create dictionaries where to store the similarity score
most_similars_by_key = {}
most_similars_by_key_2 = {}
#model was trained before and corpus was already defined
for doc in corpus_for_doc2vec:
#select the tag of the focal patent
if doc.tags[0]==j:
#extract from the list of snow up to day i the one that is most similar to the focal observation
most_similars_by_key[doc.tags[0]] = model.docvecs.most_similar_to_given(j, reference_list1)
#I have the tag of the most similar snow observation to the non-snow focal observation, but not the similarity score, thus I extract the similarity score
for key in most_similars_by_key:
maxim = most_similars_by_key[key]
sim_score = model.docvecs.similarity(key, maxim)
most_similars_by_key_2[key] = sim_score
print("prova"+str(prova))
prova=prova+1
#I merge the database with the most similar observation and the similarity score to the original one and append in a list
db1=pd.DataFrame.from_dict(most_similars_by_key, orient='index')
db1.reset_index(inplace=True)
db1=db1.rename(columns={"index": "index1", 0: "most_similar_of_snow"})
db2=pd.DataFrame.from_dict(most_similars_by_key_2, orient='index')
db2.reset_index(inplace=True)
db2=db2.rename(columns={"index": "index1", 0: "Similar_doc2vec_desc"})
db3=pd.merge(left=database_finale, right= db1, how="left", left_on=["index1"], right_on=["index1"])
db4=pd.merge(left=db3, right= db2, how="left", left_on=["index1"], right_on=["index1"])
datab.append(db4)
datab = datab[datab['Similar_doc2vec_desc'].notna()]
else:
continue
else:
continue
else:
continue
#here I cereate the final DB.
datab = pd.concat(datab)
As mentioned, this code seems to work, but when applying it to the full 30000 observations, it is extremely slow. Can anyone help me to optimize the code in order to speed up the computation?
I have tried to look into parallelizing the process, but I am not really familiar with this practice and it seems it would require writing this for loop as a function, which I am not sure I have the skills to do.
Your code is rather hard to follow, so these are hunches rather than sure things:
the potential double-loop implied in
[(i,j) for i in daytuple for j in index_tuple]
may be generating more combinations than strictly necessary; have you reviewed its output to ensure it's minimal/sensible?keeping things "in" Pandas structures may be adding extra indirection/complexity; with only 30000 items, you may just want to make them all plain Python
dict
s, in a list, in ascending date order.the
else: \n continue
formulations seem superfluous, because they all seem to appear where the continue would have been automatic anyway.generally both
.most_similar_to_given()
and even.similarity()
should be avoided if you need to do bulk similarity-calculations, because they don things 1-at-a-time, inside Python code loops. Instead, try to usemost_similar()
to get back a large batch of results all at once - it will use optimized bulk calculations.In very high-level pseudocode, working only on Python-dicts, a more-focused approach might be very roughly:
At the end, each
observation
dict
will have a(id, similarity)
tuple in itsearlier_snow_closest
value of the most-doc-vector-similar earliersnow
item.