I got this part of code:
kwfile = fitz.open(filedialog.askopenfilename()) # the keywords PDF
# the following extracts kwfile content as plain text across all pages:
text = " ".join([page.get_text() for page in kwfile])
keywords = text.replace("\n", " ").split() # make keywords list
keywords = list(set(keywords))
doc = fitz.open(filedialog.askopenfilename()) # open PDF with pymupdf
for page in doc: # loop through the page of the PDF
words = page.get_text("words") # extract page text by single words
for word in words:
if word[4] in keywords: # item 4 contains actual word text string
page.add_highlight_annot(word[:4]) # highlight the word
doc.save("markedwords.pdf")
This code needs two PDF files. One is a keyword PDF and the other one is the original PDF. If you run this code it compares both and searches for the keywords in the original PDF. At the end it creates a copy of the original PDF but with all the words it has found marked in yellow.
Now I need help in something: Is it possible to exclude words, words which mustn't marked? Because sometimes there are words like "the", "for", "and", "but", which are marked, but I do not want these words to be marked.
disclaimer I am the author of
borb, the library used in the answerI would split the problem in 3 parts:
Step 1: Get the text from a PDF
Step 2: Decide which words you would like to mark
This is up to you. There are various algorithms to decide which words are keywords in a document. Some of them are even implemented in
borbalready.You can find those here.
There are also GitHub repositories with plaintext files containing taboo/stopwords. You can include such a list in your code to avoid marking words like "for" and "the".
Step 3: Mark those words in the PDF
Marking words (or any content really) in a PDF can be done using so called "annotations". You can think of an annotation as "anything you would add after creation to an existing document".
Annotations can be:
You could also (but this is significantly harder) modify the page itself, such that rather than drawing text at a given location, you add instructions to first draw a highlighter-colored box underneath the text.
If you want more information about adding annotations to a PDF, you can find it here.