I have two pdf reports with the same format from the same source, the only difference being the report date -- one is for 2016, the other for 2015. Here's how to get the pdf's:
- Go to https://www.ffiec.gov/nicpubweb/nicweb/InstitutionProfile.aspx?parID_Rssd=1039502&parDT_END=99991231
- Select 2016-06-30 and click Create Report next to the fourth report from the top (i.e., Banking Organization Systemic Risk Report (FR Y-15))
- Click Your request for a financial report is ready and download the pdf that opens up
- Repeat steps 1-3 but choose 2015-12-31 instead in step 2
The two pdf's are regulatory filings for JP Morgan. The information I want is the numbers in blue, which can be uniquely identified by the keys to their left. E.g., the first line item on page 2 -- a. Current exposure of derivative contracts -- can be uniquely identified by M337.
Here's what I've tried to get the numbers:
- I opened the two pdf's in Notepad++ and Ctrl-F for "M337". For the 2016 pdf, the string was there and the corresponding number was not far behind. For the 2015 pdf, however, neither the string nor the number could be found
I opened the pdf's in python as binary files
with open('2016.pdf', 'rb') as handle: pdf_str = handle.read()
and searched for M337 in pdf_str. The string could be found in 2016.pdf but not in 2015.pdf
- I tried using Adobe Acrobat's Save As Other functionality to save the pdf's as txt's and got the same results -- the string was in 2016.txt but not in 2015.txt
Does anybody know what's going on?
I was able to find the key string and associated value by using
pdftotext
on the downloaded text file, see my process below:You must remember that PDF, being a binary file format, cannot easily be searched for strings without using a special Python library made for parsing PDFs. In fact,
handle.read()
returns abytes
object when the file is opened in the binary format, not a string. I'm surprised you were able to find M337 in the 2016 file by searching the raw bytes.