I have a PDF document with a few hyperlinks in it, and I need to extract the text/string from the pdf that contains a url. I have used the PyPDF2 and PyPDF4.
I am able to extract the urls but unable to extract the string that contains the url.
For example, I have text that says Check this link out, with a link attached to it in PDF. I am able to extract the link https://stackoverflow.com bu I also need Check this link out.
import PyPDF4
import requests
# Open the PDF file
pdf_file = open('abc.pdf', 'rb')
# Create a PDF reader object
pdf_reader = PyPDF4.PdfFileReader(pdf_file)
# Loop through each page of the PDF
for page_num in range(len(pdf_reader.pages)):
# Get the page object
page = pdf_reader.pages[page_num]
# Extract the annotations from the page
annotations = page.get('/Annots')
# If there are no annotations, skip to the next page
if not annotations:
continue
# Loop through each annotation
for annotation in annotations:
# Get the annotation dictionary
annotation_dict = annotation.getObject()
# If the annotation is a link, extract the URL and its associated string
if annotation_dict.get('/Subtype') == '/Link':
url_dict = annotation_dict.get('/A')
if url_dict is not None:
url = url_dict.get('/URI')
url_string = annotation_dict.get('/Contents')
if url is not None:
# Check if the URL is working or broken
try:
response = requests.get(url)
if response.status_code == 200:
print(f"Page {page_num + 1}: URL - {url}\nString - {url_string}\nWorking fine!")
else:
print(f"Page {page_num + 1}: URL - {url}\nString - {url_string}\nBroken!")
except requests.exceptions.RequestException as e:
print(f"Page {page_num + 1}: URL - {url}\nString - {url_string}\nBroken! Error: {e}")
# Close the PDF file
pdf_file.close()
Currently, in the following script, I am geeting the following result for string:
String - None
Also tried all the codes available here:
The process is theoretically simple which ever applications you use, the problem is finding the inter relationships.
First the contents need decoding to be searchable for the URI data, then that entry needs to back link to the surface locations, in this case 2 words then one, but how do we know it is that location? The URI does not say that Page as it's page-less. So we backtrack where first URI
/Annots[46 0 Ris included in this Page42 0 obj.Likewise that page is listed as second entry of
/Kids[3 0 R 42 0 Ramong pagesSo now we know we are looking for those words on page 2 at that location. And to avoid doing the same all over again (as a human intuition) if the next URI is slight lower it must be the lower word Location
Thus:
should be Enhancements.
So C#, CMD, JS, VBA, Python or PyPDF the loops are just the same as per human.
The greater remaining challenge is, defining a copy and paste function at that location on Page2 and that is where perhaps PyMuPDF may have some better rect handling. However beware, units in Y can there be needing reversal, raising yet another challenge.
As an example here is that area as defined by a MuTool "Trace" page 2 and we see X=132.6 however if we searched for Y=305.141 we would miss finding Y=307.299. Thus aligning Y is a question of setting a known tolerance/range. For example 290 to 310.
So we could use a command line to read and display those values. This is not the easiest, it just shows it's possible. However there are simpler ways you can program using Python libraries direct.
And thus we come to the simplest answer
of all which is to search a HTML reproduction of the page(s) and tidy up that result by splitting off the unwanted head and tail then replace the
 with space characters.