Microsoft Graph export to pdf using python-o365 gives invalid file

391 Views Asked by At

I am trying to download a word document saved in one drive as pdf using the python-o365 library but the downloaded file cannot be opened with adobe. I get an error Adobe Acrobat could not open 'Output.pdf' because it is either not a supported format...etc. Some of my code shown below:

my_drive = storage.get_default_drive()
attachments_folder = my_drive.get_special_folder('attachments')
items = attachments_folder.get_items()
target_file = "Example.docx"
file = list(filter(lambda x: target_file == x.name, items))[0]
file.download(to_path = r"C:\Users\UserX\OneDrive WordToPdf", name="Output.pdf",convert_to_pdf=True)

The downloaded file seems to just have a pdf extension but is actually still a Word file as it opens in word.

When I remove the extension in name to

file.download(to_path = r"C:\Users\UserX\OneDrive WordToPdf", name="Output",convert_to_pdf=True)

the resulting file has a docx extension but does open in Adobe and not in Word

How can I get this working properly? Currently working around by changing the extension after the file is downloaded.

1

There are 1 best solutions below

1
On

I was able to repro the issue. I looked little deeper on the source code at the below link.

https://github.com/O365/python-o365/blob/master/O365/drive.py

Let's focus on the below snippet - as this is responsible for converting and downloading the file in pdf.

enter image description here

As far as I have understood :

  • If the destination file name suffix is in the list (defined at the top in the same file)

enter image description here

  • If it convert_to_pdf is True

Then it goes and downloads the file in the PDF Format.

What is happening ?

So in our case - when you give a destination file name for instance "ABC.pdf" - it picks the destination file extension (PDF) - since pdf is not in the list of allowed_pdf_extensions - the file is downloaded as a normal docx (as the below line is not executed )

params['format'] = 'pdf'

That is also the reason why if you don't give the extension - it takes the source extension for the destination file - docx - docx is in the list allowed_pdf_extensions and convert_to_pdf is set to true - it downloads the file in the pdf format. (But file is named with the docx extenstion).

Possible Worakrounds :

I was able to temporary bypass the behavior - by adding the ".pdf" to the list in the drive.py local to the machine.

enter image description here

enter image description here

For now, you could write a piece of code - manually update the file to reflect the filename.

Or Author can be reached out for the same issue.