Reading PDF file with Azure Synapse Notebooks

145 Views Asked by At

It's my first post, asking for a help, before I usually used examples from Stack overflow, but can't find and answer. I am sorry, if the formatting of my post is not great, will try to improve it for the future.

I am struggling with reading PDF files from Azure Date Lake Gen 2 with Azure Synapse Notebooks.

Updated information: I am using Azure Synapse within the virtual network with private endpoint access to the storage.

Reading CSV file is not problem, I can access CSV with command:

%%pyspark
df = spark.read.load('abfss://**accountname**.dfs.core.windows.net/**file.csv**'
## If header exists uncomment line below
##, header=True
)
display(df.limit(10))

But when I tried to read PDF, it's always failing. I used libraries like pypdf2 and camelot.

pdf_file = "abfss://**accountname**.dfs.core.windows.net/**file.pdf"
# Open the PDF using PyPDF2
pdf_reader = PyPDF2.PdfReader(pdf_file)

I receive an error:

FileNotFoundError: [Errno 2] No such file or directory

I tried to mount storage location as mentioned in this post - How can I read pdf or pptx or docx files in python from ADLS gen2 using Synapse?

Still can read CSV file from that mounted storage, but not PDF.

mssparkutils.fs.mount( 
    "abfss://[email protected]/", 
    "/TR", 
    {"LinkedService":"linkedservice"} 
)

# can get a path, this command is working:
path = mssparkutils.fs.getMountPath("TR")
print(path)

import PyPDF2
with open("/synfs/mount#/TR/file.pdf") as f:
    pdf_reader = PyPDF2.PdfReader(f)

Gives an error:

OSError: [Errno 5] Input/output error:

I tried to read using path, still not working.

file_name = path + "/file.pdf"
print(file_name)
reader = PyPDF2.PdfReader(open(file_name, 'rb'))

gives an error: OSError: [Errno 5] Input/output error

Tried to use PyPDF2:

pdf_reader = PyPDF2.PdfReader(file_name)

Gives an error:

logger_warning( 310 "PdfReader stream/file object is not in binary mode. " 311 "It may not be read correctly.", 312 name,'

Please advice, if you know how to solve it. I am using Azure Synapse Studio, not SDK.

2

There are 2 best solutions below

1
JayashankarGS On BEST ANSWER

This error occurs when storage accounts are accessed via Private Endpoints on a Virtual Network.

  • If the above-created Linked Service to Azure Data Lake Storage Gen2 uses a managed private endpoint (with a dfs URI), then we need to create another secondary managed private endpoint using the Azure Blob Storage option (with a blob URI) to ensure that the internal fsspec/adlfs code can connect using the BlobServiceClient interface.

Refer to this documentation for more information.

So, create a Blob Storage private endpoint and try the code below.

import PyPDF2

jobId = mssparkutils.env.getJobId()

path=f"/synfs/{jobId}/TR/pdf/bob.pdf"

print(path)

pdf_reader = PyPDF2.PdfReader(path)
number_of_pages = len(pdf_reader.pages)
page = pdf_reader.pages[0]
text = page.extract_text()
print(text)

You can also refer to this answer.

If it is urgent, for now, you can copy the file and read.

import PyPDF2

jobId = mssparkutils.env.getJobId()

path=f"synfs:/{jobId}/TR/pdf/bob.pdf"

mssparkutils.fs.cp(path,"file:/tmp/t_pdf/bob.pdf")
pdf_reader = PyPDF2.PdfReader("/tmp/t_pdf/bob.pdf")
number_of_pages = len(pdf_reader.pages)
page = pdf_reader.pages[0]
text = page.extract_text()
print(text)

Here, I am copying the file to the tmp folder and reading from there.

0
Pashket On

I also found another solution using binary file format and byte array, without using a mount drive.

import PyPDF2
import io
# PDF file path
pdf_file_path = f"abfss://{containerName}@{storageName}/Folder/test.pdf"

# Load the PDF file using Spark's binaryFile method
pdf_binary_data = spark.read.format("binaryFile").load(pdf_file_path)

# Extract the binary content of the PDF file
pdf_content = pdf_binary_data.select("content").collect()[0][0]

# Convert the binary content to a Python bytearray
pdf_bytearray = bytearray(pdf_content)

# Use PyPDF2 to read the PDF file from the bytearray and extract the total number of pages
pdf_reader = PyPDF2.PdfReader(io.BytesIO(pdf_bytearray))
num_pages = len(pdf_reader.pages)

# Print the total number of pages
print("Total number of pages:", num_pages)

# Get the first page
first_page = pdf_reader.pages[0]

# Extract text
text = first_page.extract_text()