Fetch Attached (Embedded) Pdf from Excel in either PHPExcel or PHPSpreadsheet

90 Views Asked by At

I have an excel file which contains PDF - embedded (attached) in it.

I am trying to use PHPExcel and PHPSpreadsheet to fetch the data. I am successful in fetching the images but other objects like PDF are not accessible

My first try is using PHP but I am also fine if its possible with Python

1

There are 1 best solutions below

0
K J On

XLSX is a Zip container of Excel components so we can open the zip file and manipulate the contents.

enter image description here

Our Objects of interest are in the "embeddings" folder and if there is only one embedding it is easy to extract as oleObject1.bin so one line to extract and one line to start editor or your customised python find and save.

enter image description here

In that BIN file we can file seek the address of the PDF header %PDF- here at 00002240 enter image description here

Also file seek its EOF @ 00004794 %%EOF\x0A

enter image description here

Now using any method such as Heads and Tails, splice out that PDF in this case 2554 bytes and save as BINary.pdf

enter image description here

enter image description here

I wrote a script to extract a PDF from an office bin file on Windows OS so after un TAR, Windows users can run this script. NOTE it has 2 small .exe dependencies you need to download and specify a path so see and edit start of file. For PHP you should be able to emulate that in Python so for starters see https://stackoverflow.com/a/56742848/10802527

@echo off
REM dependencies are 
REM Didier Stevens middle.exe from https://blog.didierstevens.com/programs/binary-tools/
REM Mark Russinovich strings.exe from https://learn.microsoft.com/en-us/sysinternals/downloads/strings

REM both above to be placed on path or folder e.g.
set "utils=C:\Downloads\Apps\utils"

setlocal enableDelayedExpansion
if not exist "%~dpn1.bin" echo %0 requires a bin file to work on & pause & exit /b

"%utils%\strings.exe" -o "%~1"|Findstr "%PDF-">AcroHEAD.txt
set /p HEAD=<AcroHEAD.txt
if [%HEAD%]==[] echo %PDF- Header not found & del Acro????.txt & pause & exit /b
echo !HEAD! >AcroHEAD.txt
for /f "tokens=1 delims=:" %%f in (AcroHEAD.txt) do set START=%%f

"%utils%\strings.exe" -o "%~1"|Findstr "%%EOF">AcroTAIL.txt
for /f "tokens=1 delims=:" %%f in (AcroTAIL.txt) do set TAIL=%%f
set /a LEN=%TAIL%+6-%START%

del Acro????.txt
"%utils%\middle.exe" "%~1" %START% %LEN% "%~dpn1.pdf"