How can I read pdf or pptx or docx files in python from ADLS gen2 using Synapse?

166 Views Asked by At

I am looking to read in files of different formats with python in a Synapse notebook. These include .pdf, .pptx, .docx, .msg, and .eml. I would like to be able to read in the files then parse and manipulate them with python. I was able to do this in data bricks using different python libraries.

This is how I had accomplished this in Data Bricks:

from pptx import Presentation
prs = Presentation(file_name)

# for pdf
from pypdf import PdfReader
reader = PdfReader(open(filename, 'rb'))

# word docs
import docx
doc = docx.Document(file_name)

# .eml files
import email
msg = email.message_from_file(open(file_name))type here

# .msg files
import extract_msg
msg = extract_msg.Message(file_name)

In Synapse I have been getting an error: FileNotFoundError: [Errno 2] No such file or directory.

These file paths work to read in csv, excel or txt data using spark or pandas so I don't think there is a authorization or connectivity issue. The format is: abfs[s]://file_system_name@account_name.dfs.core.windows.net/file_path

I also tried mounting the storage location. This did help to read in text files but not for the other formats. Mounting Storage locations in Synapse

1

There are 1 best solutions below

0
EB613 On BEST ANSWER

Mounting was the right approach as this answer explains. I was using Synapse studio . The key was to use the file format obtained from the path command of the mounted storage. Otherwise I could basically use what I used previously as mentioned in my question. Only pdf I had to change from using the pypdf library to pypdf2.

the format that worked was:

path = mssparkutils.fs.getMountPath("/mounted_name") 
# this gave me this format '/synfs/{jobId}/mounted_path/{filename}'

What did not work was the format obtained from mssparkutils fs

mssparkutils.fs.ls("synfs:/{jobId}/mounted_path/") 
# this gave a different format which did not work   'synfs:/{jobId}/mounted_path/{filename}'

Here is the whole process:

First install the library you will need. Mounting the storage is described here. Then read the file using the PyPDF2 library.

!pip install PyPDF2  
    
    
# Then mount the storage location 
    
from notebookutils import mssparkutils
mssparkutils.fs.mount( "abfss://mycontainer@<accountname>.dfs.core.windows.net", "/test", {"LinkedService":"mygen2account"} )
    
# get mounted path
path = mssparkutils.fs.getMountPath("/test")
file_name  = path + '/filename'
    
# now read the file 
from PyPDF2 import PdfReader
    
reader = PdfReader(open(file_name, 'rb'))