Get all images in a Word document

2.5k Views Asked by At

I am trying to get a collection of the images in a Word document. The documentation of this page: https://dev.office.com/reference/add-ins/word/inlinepicture literally is a cut'n'paste for the examples, and does not show actually how to get the images - only the first one.

I need the following things per image:

  • The data
    In any format is fine. I see there is a getBase64ImageSrc method - this will do.
  • The filename
    No filename is fine - I can see the API does not have it - I can build it with the alt text or just image_{n} where {n} is the image index, but I cannot see a way to get the extension - is this in the data as a data:image/jpeg;blahblah??? I don't know the docs don't have this level of information.

I have the following code so far but am really unsure if it will work:

Word.run(

async (context) =>
{
    // Create a proxy object for the pictures.
    const allPictures = context.document.body.inlinePictures;

    // Queue a command to load the pictures
    context.load(allPictures);

    // Synchronize the document state by executing the queued commands,
    // and return a promise to indicate task completion.
    return context.sync().then(() => allPictures);
})
.then((allPictures) =>
{
    const images: IFileData[] = [];
    let picture: Word.InlinePicture | undefined;
    let imageCount = 0;

    while (undefined !== (picture = allPictures.items.pop()))
    {
        const data = picture.getBase64ImageSrc();
        const extension = ""; // TODO: no idea how to find this
        const filename =
            (
                Strings.isNullOrEmpty(picture.altTextTitle)
                    ? `image_${imageCount++}`
                    : Path.toFriendlyUrl(picture.altTextTitle)
            )

        images.push({
            filename: filename + extension,
            data: data
        });
    }

    resolve(images);
})
.catch((e) => reject(e));

I am using some custom helpers here they do the following:

  • Strings.isNullOrEmpty
    Return true if string is null or empty, otherwise false
  • Path.toFriendlyUrl
    Returns the string with spaces converted to - and some other improvements

Is my current approach correct?

2

There are 2 best solutions below

4
Juan Balmori On BEST ANSWER

Please check out this sample that is doing what you need. I think you are in the right track.

Here is some sample code:

async function run() {
    await Word.run(async (context) => {

        let myImages = context.document.body.inlinePictures;
        myImages.load("imageFormat");

        await context.sync();
        
        if (myImages.items.length >0)
        console.log(myImages.items[0].imageFormat);
        else
        console.log("no image found.")


    });
}

Note that we have an imageFormat property, the problem is that we have it in on the preview CDN. (use https://appsforoffice.microsoft.com/lib/beta/hosted/office.js). we don't have the image name, but you can use alt text to store it.

2
Cindy Meister On

"Correct" is what works... I can address the one specific question: getting the image type - what you call "file name". Since this is a bit long, the answer is: you can, but you have to work for it a bit.

Word does not always store a file name, as such, for an image in a document unless that image is linked to an outside source. What it does store, however, is the image itself with the necessary information to manage it within the Word Open XML document. One part of the information stored is the graphic image type as part of an internal relationship between the document and the image's binary code.

The object model (whether JS or COM) does not provide any direct access to this information. It can be read, however, from the document's Word Open XML. This code can obtain the specific Word Open XML string for the InlineShape in the OPC flat file format:

    const range = context.document.body.inlinePictures.getFirst();
    var sXML = range.getRange("Whole").getOoxml();
    range.load("Ooxml");

    await context.sync();

    console.log(sXML.value);

In the document.xml part of the Open XML an InlineShape is referenced (in part) as follows - see the very last element with the attribute r:embed="rId6".

<w:p><w:r><w:drawing><wp:inline distT="0" distB="0" distL="0"
distR="0"><wp:extent cx="2944608" cy="1753392"/><wp:effectExtent l="0"
t="0" r="8255" b="0"/><wp:docPr id="1" name="Picture 1"/>
<wp:cNvGraphicFramePr><a:graphicFrameLocks noChangeAspect="1" 
xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main"/>
</wp:cNvGraphicFramePr><a:graphic xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main">
<a:graphicData uri="http://schemas.openxmlformats.org/drawingml/2006/picture">
<pic:pic xmlns:pic="http://schemas.openxmlformats.org/drawingml/2006/picture">
<pic:nvPicPr><pic:cNvPr id="0" name="Schweiz.png"/><pic:cNvPicPr/></pic:nvPicPr>
<pic:blipFill><a:blip r:embed="rId6">...

rId6 is the relationship ID - it tells Word where to look up the details about the embedded image. This is found in <pkg:part pkg:name="/word/_rels/document.xml.rels", like this:

<Relationship Id="rId6" 
Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/image" 
Target="media/image1.png"/>

As you can see, the file type is available here. If you use standard XML tools to parse the XML string, you can get the information like this.

An alternative to using standard XML techniques would be to analyze the Word Open XML using either the standard Microsoft Open XML SDK (C# or VB.NET) or use the Open XML SDK for JavaScript (http://www.ericwhite.com/blog/open-xml-sdk-for-javascript/). In this case, you're not able to read the "rels" directly. Instead, the "Tools" look up the corresponding "package" (in this case, "media/image1.png") and return that information. As you can see, that includes the attribute pkg:contentType, which gives you the file extension.

<pkg:part pkg:name="/word/media/image1.png" pkg:contentType="image/png" pkg:compression="store">