Using Python to search for hidden data

119 Views Asked by At

I have a possibly weird question that I have searched a million ways for a solution without success. I'm hoping it's simply because I don't know how to ask the Google.

We've recently discovered a problem with how pdf documents were redacted in an effort to keep them readable and thus searchable by AI.

However, this is creating issues with hidden sensitive data that becomes hidden to us but, could be extracted by the wrong people. Those of you working in cybersecurity know exactly how, I know:)

We can see there are types of data come up in a partial sanitization process but within Adobe we cannot view it.

My boss wants me to find out what's in these different subtypes of what Adobe views as "sensitive" data to see if we need to run this on thousands upon thousands of previously processed pdf documents (personally, I'm in the better safe than sorry camp but...)

I'm currently using Python and PyPDF2 to practice on a report I created on my own computer. But the problem I'm running into is that I'm not searching for specific data. It's like I'm searching for all data under a tag that I don't know the tag that Adobe uses (if that makes any sense.) For instance, they use: "Metadata", "Bookmarks", "Comments and "Markup", "Hidden Text", "Links, actions and javascripts", "Overlapping objects"

How can I use Python to search for the data in these tags? They don't readily display.

Thanks in advance!

This is the bare bones script to read the PDF that everything is based on:

import PyPDF2

pdfFileObj = open("MRA.pdf",'rb')

pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

print(pdfReader.numPages)

pageObj = pdfReader.getPages(0)


print(pageObj.extractText())

pdfFileObj.close()

We've tried doing keyword searches in individual documents where we know sensitive data exists because it came up in an initial partial sanitization review in Adobe DC. That code is long and not available as it's on my work computer and involves keywords I can't share.

0

There are 0 best solutions below