I wrote a code based on the TF-IDF algorithm to extract keywords from a very large text. The problem is that I keep getting the division by zero error. When I debug my code, everything is working perfectly. As soon as I make the text shorter to contains the word that causes the problem, it works. So, I assume that it's a memory problem.
I thought maybe I could read the big text file in chunks (1KB) instead of reading the whole document in the first place. Unfortunately, it does not work. what should I do? (I am using pycharm on windows)
I am a beginner in programming, python, and NLP domain. Therefore, I really appreciate it if you could help me here.
if __name__ == "__main__":
with open('spli.txt') as f:
for piece in read_in_chunks(f):
#print(piece)
piece = piece.lower()
no_punc_words, all_words = text_split(piece)
no_punc_words, all_words = rm_stop_word(no_punc_words, all_words)
no_punc_words_freq, all_words_freq = calc_freq(no_punc_words, all_words)
tf_score = calc_tf_score(no_punc_words_freq)
idf_score = calc_idf_score(no_punc_words_freq, all_words_freq, piece)
tf_idf_score = {}
for k in tf_score:
tf_idf_score[k] = tf_score[k] * idf_score[k]
#print(final_score)
final_tf_idf = {}
for scores in tf_idf_score:
final_tf_idf += tf_idf_score
print(final_tf_idf)