How to solve (cid:x) pdfplumber python text extraction

989 Views Asked by At

PDF_Doc

I've been working with the pdfplumber library to extract text from pdf documents and it's been fine, however in the documents I'm working on now, I just get spaces and lots of (cid:x) instead of text. Any solution? Thanks

with pdfplumber.open(fatura) as pdf:
    lista_paginas = pdf.pages

    fatura_individual = ''
    for pagina in lista_paginas[:len(lista_paginas)]:
        fatura_individual += pagina.extract_text()
       
(cid:12)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0),(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0),(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0),(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0),(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0),(cid:0)(cid:0)(cid:0)(cid:0)(cid:0),(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0),(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0),(cid:0)(cid:0)(cid:0)(cid:0)(cid:0),(cid:0)(cid:0)(cid:0)(cid:0)(cid:0),(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0),(cid:0)(cid:0)(cid:0)(cid:0)(cid:0),(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:16)

Just want to extract the full text

2

There are 2 best solutions below

4
On

Try PyPDF2 : https://pypdf2.readthedocs.io/en/latest/user/extract-text.html

from PyPDF2 import PdfReader

reader = PdfReader("example.pdf")
for page in reader.pages:
    print(page.extract_text())
0
On

We changed from PyPDF2 to pdfplumber because PyPDF was having problems with some documents. We combined a function from this error page - https://github.com/jsvine/pdfplumber/issues/29

text_str = '(cid:66)'
if 'cid' in text_str.lower():
    text_str = text_str.strip('(')
    text_str = text_str.strip(')')
    ascii_num = text_str.split(':')[-1]
    ascii_num = int(ascii_num)
    text_val = chr(ascii_num)  # 66 = 'B' in ascii

and came up with this function; you may need to adjust it for your specific needs because we add page breaks, but the key to the function is the prune_text() function:

import re
import pdfplumber

def process_pdf_file_without_images(file_path):

    def prune_text(text):

        def replace_cid(match):
            ascii_num = int(match.group(1))
            try:
                return chr(ascii_num)
            except:
                return ''  # In case of conversion failure, return empty string

        # Regular expression to find all (cid:x) patterns
        cid_pattern = re.compile(r'\(cid:(\d+)\)')
        pruned_text = re.sub(cid_pattern, replace_cid, text)
        return pruned_text

    with pdfplumber.open(file_path) as pdf:
        content = ""
        for page_number, page in enumerate(pdf.pages, start=1):
            content += f"Page {page_number} Start:\n"
            page_text = page.extract_text(x_tolerance=3, y_tolerance=3)
            if page_text:
                pruned_text = prune_text(page_text)
            else:
                pruned_text = ""
            content += pruned_text
            content += f"\nPage {page_number} End:\n"
        return content