I compute the tf-idf everyday in my pipeline using pyspark to evaluate the significance of a keyword in a specific document. This enables me to generate a summary for utilization in my machine learning model. Although the documents in my pipeline change daily, many keywords persist. Storing the historical information of document frequency for each keyword is impractical and not possible.
How can I approximate or incrementally calculate the IDF score for a given keyword in this scenario?
IDF calculation:
idf(t) = log(D / (d: t in d))