searching for matching words in pdf using page.searc_for

72 Views Asked by At

I have a list of words which I am searching in a pdf document using fitz in python The code generally works for most of the words except for a few like "efficiency"

My code is given below :

        if (len(re.findall(f'\\b{phrase.casefold()}s?\\b', mpage.casefold(), flags=0))>0) :
        
             text_instances = page.search_for(phrase, quads=True)

This code works for mostly all words except for some words e.g. efficiency For the word "efficiency", the if statement successfully matches but the page.search_for statement does not match The word efficiency given in the image below has different fonts for first and second f Is it because of this that the word is not matched?

enter image description here

1

There are 1 best solutions below

0
On

I got the solution. In order to disregard ligatures, we should set flags = 0 as

text_instances = page.search_for(phrase,flags = 0, quads=True)

This link helped me finding the solution https://github.com/pymupdf/PyMuPDF/issues/1503

Thanks to @jorj-mickie https://stackoverflow.com/users/4474869/jorj-mckie