Searching inside the metadata of the PDF documents

121 Views Asked by At

I have been using Google custom search API for the following task:

  • Search for certain keywords with "filetype:pdf"

This works fine as expected, however it only allows searching within the content of the PDF documents. However, I am trying to search within the Metadata of the PDF documents, or within the content stream of the PDF documents. I have searched a lot and I think there is no way to do this with Google. I was wondering if there is any other search engines that you think I can achieve what I want?

Thank you

1

There are 1 best solutions below

0
On

i find this on github but the repo was archived. It's using differents combinaisont and way, the script is not updated but i think if you use :

selenium PyPDF2 PyMuPDF json

and others regex techniques by modifing this script,

youn can get there.

https://github.com/TebbaaX/Katana

and with selenium PyPDF2 PyMuPDF bsf4:

https://pypi.org/project/PyMuPDF/

https://pypi.org/project/PyPDF2/

https://pypi.org/project/BeautifulSoup/

I don't know if this can help you, but logically you have to scrape files and run analyzes on them to extract the metadata ?