What is a more optimal method for performing this Pandas Computation

62 Views Asked by At

I am trying to build a useable NLP corpus but getting bottlenecked by how long the program takes (200 hours). With so much data I know that optimizing my code even a little bit will net me huge time savings down the road, so I wanted to post this code and ask for some advice on speeding it up. I added a parameter to load the dataset if I have already made it, resulting in much faster times afterward, and tested on a small portion of the dataset, not the whole thing. When generating the tables, kwargs expects a size parameter and for loading it expects two file directory parameters. The features "full_text" is the raw full text of an academic article and "full_text_cleaned" has had the text set to lowercase and punctuation stripped. I can also provide how I created this dataset if requested.

Sample Corpus: 29000 entries

paper_id title abstract full_text full_text_cleaned
string string string string string

Relevant code

class Doc_Finder():
    def __init__(self, corpus_file):
        self.corpus = pd.read_csv(corpus_file)
        self.corpus_dir = corpus_file

    def make_keywords(self, save=False, load=False, **kwargs):
        if save and load:
            raise Exception("Invalid Parameters")
        if save:
            self.keyword_index = pd.DataFrame(columns=['id','keyword'])
            total = kwargs['size']
            self.keyword_map = []
            for _, row in tqdm(self.corpus.head(total).iterrows(), total=total):
                s = []
                for i in str(row['full_text_cleaned']).split(' '): 
                    if i not in s and not i.isnumeric() and i not in stopwords.words('english') and i.isalnum() and not isHyperlink(i):
                        s.append(i)
                        if i not in self.keyword_map:
                            self.keyword_map.append(i)
                for i in s:
                    self.keyword_index.loc[len(self.keyword_index.index)] = [row['paper_id'], i]
            
            if save:
                idx_dir = './keyword_index_' + str(total) + '.csv'
                self.keyword_index.to_csv(idx_dir, index=False)
                tempdf = pd.DataFrame(data=self.keyword_map, columns=['keyword'], dtype=str)
                map_dir = './keyword_map_' + str(total) + '.csv'
                tempdf.to_csv(map_dir, index=False)
        elif load:
            self.keyword_index = pd.read_csv(kwargs['index_dir'])
            self.keyword_map = pd.read_csv(kwargs['map_dir'])['keyword'].tolist()

I run the code using this snippet

doc_finder = Doc_Finder(sample_corpus)
doc_finder.make_keywords(save=True, size=len(doc_finder.corpus['paper_id']))

The result is two csvs:

The keyword index which has a paper_id and keyword pair. Listing each keyword and each paper.

paper_id keyword
00033d5a12240a8684cfe943954132b43434cf48 extraction
00033d5a12240a8684cfe943954132b43434cf48 expected

Then the set of all keywords from all the documents. This is in an attempt to make the document retrieval I will be doing after this faster, as I can match against the set of keywords and then query all the papers that have those keywords in the index table.

Keyword list

keyword
expected
...

I recognize that an SQL table of some kind would probably be a better solution, but I am limited to my local machine and do not know SQL very well. I also found a library called Dask but I think I would need to rethink how I perform this creation process since not all of the techniques would transfer. I am a little out of my comfort zone but this has been a rewarding experience.

0

There are 0 best solutions below