Replace (cid:<number>) with characters using Python when extracting text from PDF files

118 Views Asked by At

Here is the link to the pdf, I'm trying to download the information but it comes back to me encrypted, tips on how and can it be fixed at all? :(

def engine_stopped(self):
    print(self.snapshot_url)

def parse(self, response):
    pdf = pdfplumber.open(io.BytesIO(response.body))
    extract_text = ""
2

There are 2 best solutions below

2
Cam On

You can do this very quickly with pymupdf.

import pandas as pd
import fitz

path = r'file_path'
text = ''
doc = fitz.open(path) 

for page in doc.pages(0,74):
    text += page.get_text()

print(text)

I tested it with World-Bank-Notes-on-Debarred-Firms-and-Individuals.pdf and it worked fine.

This solution will make one large text file of the document.

2
K J On

There should be no need to get the CIDs as the file is conventional, it is NOT encrypted simply many binary mixed encodings where some are compressed to keep file size down to a workable size. So on inflation (decompression) we see no problems.

for English the Spacebar CID is the same value "20"

1 beginbfchar
<20> <0020>
endbfchar

For others the ToUnicode values are there too. For example a MS-Mincho equivalence, so there should be no need for the CIDs, just export the Unicode equivalences

17 beginbfchar
<1613> <5EFA> = 建 
<3748> <9650> = 限
... 
<2068> <6ED5> = 滕
...
<22E4> <738B> = 王
...
<0BF7> <4EFD>
<13F9> <5B8F>
endbfchar

NOTE there is often no human rational ordering such as numeric, we are dealing with pure binary in a text visualisation. So in the NOW decoded body text we find:-

/C2_0 1 Tf
0.293 0 Td
<206822E4>Tj                                 = 滕王
/C2_1 1 Tf
<65E8>Tj                                     = 阁
/C2_0 1 Tf
-0.011 Tc 0.011 Tw [<16131564>-11 <379F>]TJ  = 建工   集
/C2_1 1 Tf

These characters can be seen at the end of one line before another one starts and can be extracted as a binary text string 滕王阁建工集团股份有限公司 which represents Tengwangge Construction Engineering Group Co., Ltd (https://www.qcc.com/firm/42e32d7afd541a697c8ffbe6b3ee23b8.html)

This can be easily seen in pdftotext (one single command line to export) as here.
pdftotext -layout -enc UTF-8 -fixed 4 "%~1" "%~dpn1-all.txt"

enter image description here

In comments the word "Console" was mentioned and here there is a different mix of problems when multiple languages are used since there is an overlap between European UTF-8 requirements and CJK requirements thus only one is likely to work at a time. Here the console is set to Chinese, so the European entry looks wrong and visa versa (the European curly "Quote Marks" dirty the Chinese). This is down to OS console limitations not present inside the PDF or its text output. (Bottom of page 34) enter image description here

Basically poor Authoring control for international users.
Use pdftotext -layout -enc UTF-8 -fixed 4 "c:/path/inputfile.pdf" (with other options) and open the text file not the console.

*258 The period of ineligibility of Mr. Luis Sánchez Santur ("Mr. Sánchez”) ex  
...
*260 The period of ineligibility of Mr. Angel Zambrano Navarro ("Mr. Zambrano”) extends to any legal  
...
*261 The period of ineligibility of Shandong Hualong Landscaping Engineering Co., Ltd. (山东华龙园林  
工程有限公司) (“Shandong Hualong”) extends to any legal entity that it directly or indirectly controls.