Pymupdf highlight difference between 2 pdf pages

57 Views Asked by At

Trying to compare 2 pdf pages - p1 and p2 and highlight the difference in p1

Algorithm:

1. Get text_blocks with bounding_box from each_page
2. Compare text_blocks of p1 with p2
3. for every text_block which is different use the respective bounding_box to highlight the diffeerence

Code:

def get_text_blocks(page):

    blocks = []
    blocks_bbox = []
    blocks = page.get_text_blocks()
    for block in blocks:
        #appending the bounding box of the block
        blocks_bbox.append(block[0:4])
        #appending the text from the block
        blocks.append(block[4])
   return blocks, blocks_bbox

difference psuedo_code:

diff = [list of text_blocks IN p1 and NOT IN p2]
for each_diff in diff:  
     #get the bounding_box of the difference block
     rect = fitz.rect(bounding_box)
     annot = p1.add_highlight_annot(rect)
     annot.update()

This works. But in certain cases though the contents are identical they get grouped into different text blocks so while comparing it is highlighting wrong.

Example:

p1:

block_1: line1, line2
block_2: line3

p2:

block_1: line1, line2, line3

Though the identical 3 lines (back-to-back) - line1, line2, line3 are present in both the pages p1 and p2 since the blocks are different it is getting flagged

Also, tried the get_text and compare line by line approach, it is not working.

Any suggestions on how to fix this will be helpful?

0

There are 0 best solutions below