Maximum Doc2vec similarity between observation and subset at given point in time

86 Views Asked by At

I have a large dataframe (about 30000 obs) named database_finale. The columns relevant for this post are:

  • index1: identifies each observation and is the tag in the doc2vec
  • app_date2: is the date of each observation written as an integer
  • clean_desc: is the cleaned and tokenized text for the similarity
  • snow_pat: is a dummy variable that identifies if the observation belongs to category "snow".

I want to create a new column containing the maximum textual similarity of each "non-snow" observation at its date to all the "snow" observations existing up to the date of the focal observation. As the date increases with the focal observations, so do the reference observations in "snow" category, which increase in number.

After identifying the corpus and training the model, I have tried this code on a subsample of 200 observation and it seems to do the job:

datab=[]
#daytuple is a tuple of app_date2 values, index_tuple is a tuple of index1 values
for i, j in [(i,j) for i in daytuple for j in index_tuple]:
    #here I extract the row of the dataframe of the focal non-snow observation for which I want to find the max similarity to snow
    m=database_finale.loc[(database_finale.index1 ==j ) & (database_finale.app_date2 == i ) & (database_finale["snow_pat"] !=1)]
    result=m.empty
    #if the extracted line is non-empty
    if result== False:
        #here I extract all the "snow" observations up to day i
        l=database_finale.loc[(database_finale.app_date2 <= i) & (database_finale["snow_pat"] ==1)]
        result2=l.empty
        #if also this one is non-empty       
        if result2==False:
            #I create a list of the "snow" references
            reference_list1=l["index1"].tolist()
            #I create dictionaries where to store the similarity score
            most_similars_by_key = {}
            most_similars_by_key_2 = {}
            #model was trained before and corpus was already defined
            for doc in corpus_for_doc2vec:
                #select the tag of the focal patent
                if doc.tags[0]==j:
                    #extract from the list of snow up to day i the one that is most similar to the focal observation
                    most_similars_by_key[doc.tags[0]] = model.docvecs.most_similar_to_given(j, reference_list1)
                    #I have the tag of the most similar snow observation to the non-snow focal observation, but not the similarity score, thus I extract the similarity score
                    for key in most_similars_by_key:
                        maxim = most_similars_by_key[key]
                        sim_score = model.docvecs.similarity(key, maxim)
                        most_similars_by_key_2[key] = sim_score
                        print("prova"+str(prova))
                        prova=prova+1
                        #I merge the database with the most similar observation and the similarity score to the original one and append in a list
                        db1=pd.DataFrame.from_dict(most_similars_by_key, orient='index')
                        db1.reset_index(inplace=True)
                        db1=db1.rename(columns={"index": "index1", 0: "most_similar_of_snow"})
                        db2=pd.DataFrame.from_dict(most_similars_by_key_2, orient='index')
                        db2.reset_index(inplace=True)
                        db2=db2.rename(columns={"index": "index1", 0: "Similar_doc2vec_desc"})
                        db3=pd.merge(left=database_finale, right= db1, how="left", left_on=["index1"], right_on=["index1"])
                        db4=pd.merge(left=db3, right= db2, how="left", left_on=["index1"], right_on=["index1"])
                        datab.append(db4)
                        datab = datab[datab['Similar_doc2vec_desc'].notna()]  
                else:
                    continue
        else:
            continue
    else:
        continue
#here I cereate the final DB.
datab = pd.concat(datab)

As mentioned, this code seems to work, but when applying it to the full 30000 observations, it is extremely slow. Can anyone help me to optimize the code in order to speed up the computation?

I have tried to look into parallelizing the process, but I am not really familiar with this practice and it seems it would require writing this for loop as a function, which I am not sure I have the skills to do.

1

There are 1 best solutions below

0
On

Your code is rather hard to follow, so these are hunches rather than sure things:

  • the potential double-loop implied in [(i,j) for i in daytuple for j in index_tuple] may be generating more combinations than strictly necessary; have you reviewed its output to ensure it's minimal/sensible?

  • keeping things "in" Pandas structures may be adding extra indirection/complexity; with only 30000 items, you may just want to make them all plain Python dicts, in a list, in ascending date order.

  • the else: \n continue formulations seem superfluous, because they all seem to appear where the continue would have been automatic anyway.

  • generally both .most_similar_to_given() and even .similarity() should be avoided if you need to do bulk similarity-calculations, because they don things 1-at-a-time, inside Python code loops. Instead, try to use most_similar() to get back a large batch of results all at once - it will use optimized bulk calculations.

In very high-level pseudocode, working only on Python-dicts, a more-focused approach might be very roughly:

earlier_snow_observations = set()
for observation in all_observations_earliest_to_latest:
    if observation['snow_pat']:
        earlier_snow_observations.add(observation['index1'])
        continue  # no need to find nearest-preceding
    all_ranked_similar = d2v_model.dv.most_similar(observation['index1'], topn=len(all_observations))
    for id, sim in all_ranked_similar:
        if id in earlier_snow_observations:
            observation['earlier_snow_closest'] = (id, sim)
            break  # exit early
    else:
        # no earlier snow observations
        observation['earlier_snow_closest'] = None  # or maybe just pass?

At the end, each observation dict will have a (id, similarity) tuple in its earlier_snow_closest value of the most-doc-vector-similar earlier snow item.