Trying to compare 2 pdf pages - p1 and p2
and highlight the difference in p1
Algorithm:
1. Get text_blocks with bounding_box from each_page
2. Compare text_blocks of p1 with p2
3. for every text_block which is different use the respective bounding_box to highlight the diffeerence
Code:
def get_text_blocks(page):
blocks = []
blocks_bbox = []
blocks = page.get_text_blocks()
for block in blocks:
#appending the bounding box of the block
blocks_bbox.append(block[0:4])
#appending the text from the block
blocks.append(block[4])
return blocks, blocks_bbox
difference psuedo_code:
diff = [list of text_blocks IN p1 and NOT IN p2]
for each_diff in diff:
#get the bounding_box of the difference block
rect = fitz.rect(bounding_box)
annot = p1.add_highlight_annot(rect)
annot.update()
This works. But in certain cases though the contents
are identical
they get grouped into different text blocks
so while comparing it is highlighting wrong.
Example:
p1:
block_1: line1, line2
block_2: line3
p2:
block_1: line1, line2, line3
Though the identical 3 lines (back-to-back) - line1, line2, line3
are present in both the pages p1
and p2
since the blocks
are different it is getting flagged
Also, tried the get_text
and compare line by line
approach, it is not working.
Any suggestions on how to fix this will be helpful?