I have been using Google custom search API for the following task:
- Search for certain keywords with "filetype:pdf"
This works fine as expected, however it only allows searching within the content of the PDF documents. However, I am trying to search within the Metadata of the PDF documents, or within the content stream of the PDF documents. I have searched a lot and I think there is no way to do this with Google. I was wondering if there is any other search engines that you think I can achieve what I want?
Thank you
i find this on github but the repo was archived. It's using differents combinaisont and way, the script is not updated but i think if you use :
selenium
PyPDF2
PyMuPDF
json
and others
regex
techniques by modifing this script,youn can get there.
https://github.com/TebbaaX/Katana
and with
selenium
PyPDF2
PyMuPDF
bsf4
:https://pypi.org/project/PyMuPDF/
https://pypi.org/project/PyPDF2/
https://pypi.org/project/BeautifulSoup/
I don't know if this can help you, but logically you have to scrape files and run analyzes on them to extract the metadata ?