Python failed to extract PDF text

200 Views Asked by At

I have two pdf reports with the same format from the same source, the only difference being the report date -- one is for 2016, the other for 2015. Here's how to get the pdf's:

  1. Go to https://www.ffiec.gov/nicpubweb/nicweb/InstitutionProfile.aspx?parID_Rssd=1039502&parDT_END=99991231
  2. Select 2016-06-30 and click Create Report next to the fourth report from the top (i.e., Banking Organization Systemic Risk Report (FR Y-15))
  3. Click Your request for a financial report is ready and download the pdf that opens up
  4. Repeat steps 1-3 but choose 2015-12-31 instead in step 2

The two pdf's are regulatory filings for JP Morgan. The information I want is the numbers in blue, which can be uniquely identified by the keys to their left. E.g., the first line item on page 2 -- a. Current exposure of derivative contracts -- can be uniquely identified by M337.

Here's what I've tried to get the numbers:

  1. I opened the two pdf's in Notepad++ and Ctrl-F for "M337". For the 2016 pdf, the string was there and the corresponding number was not far behind. For the 2015 pdf, however, neither the string nor the number could be found
  2. I opened the pdf's in python as binary files

    with open('2016.pdf', 'rb') as handle: pdf_str = handle.read()

    and searched for M337 in pdf_str. The string could be found in 2016.pdf but not in 2015.pdf

  3. I tried using Adobe Acrobat's Save As Other functionality to save the pdf's as txt's and got the same results -- the string was in 2016.txt but not in 2015.txt

Does anybody know what's going on?

1

There are 1 best solutions below

3
On

I was able to find the key string and associated value by using pdftotext on the downloaded text file, see my process below:

$ pdftotext FRY15_1039502_20151231.PDF
$ grep -C 10 'M337' FRY15_1039502_20151231.txt 
b. Regulatory adjustments........................................................................................
4. Other off-balance sheet exposures:
a. Gross notional amount of items subject to a 0% credit conversion factor (CCF) ...............
b. Gross notional amount of items subject to a 20% CCF................................................
c. Gross notional amount of items subject to a 50% CCF................................................
d. Gross notional amount of items subject to a 100% CCF ..............................................
e. Credit exposure equivalent of other off-balance sheet items (sum of 0.1 times item 4.a,
0.2 times item 4.b, 0.5 times item 4.c, and item 4.d) ...................................................
5. Total exposures prior to regulatory deductions (sum of items 1.h, 2.e, 3.a, and 4.e) .............

M337
M339
Y822
M340
Y823
Y824
Y825

71624000
387577000
3535000

You must remember that PDF, being a binary file format, cannot easily be searched for strings without using a special Python library made for parsing PDFs. In fact, handle.read() returns a bytes object when the file is opened in the binary format, not a string. I'm surprised you were able to find M337 in the 2016 file by searching the raw bytes.