Automated extraction of word docx archive and pretty print conversion of xml files

Question

Automated extraction of word docx archive and pretty print conversion of xml files

546 Views Asked by grenix At 17 August 2025 at 08:52

If renaming eg. a file document.docx to document.docx.unzipped.zip it is possiple to extract that archive eg. to a folder 'document.docx.unzipped'. Unfortunatly the extracted xml-files are not very readable since all xml-information is in one single line.

I would like to automate the process of extracting a docx archive and converting all xml-files from the archive resp. the extraction folder (document.docx.unzipped) to readable/prettyprinted versions (like Notepad++ --> Extensions --> XML Tools --> Pretty Print (XML only with line breaks))

Any ideas for a quick approach?

EDIT1: modified Idea from https://stackoverflow.com/users/1761490/pawel-jasinski

#!/bin/sh


# this scripts unpacks and reformat docx files
#
# you need xslt processor (Transform) in your path
# /c/Program Files/Saxonica/SaxonHE9.4N/bin/Transform
#
# make sure to copy remove-rsid.xslt and copy.xslt
if [ "$1" = "-r" ]; then
    remove_rsid=1
    shift
fi

if [ "$1" = "" ]; then
    echo expected name of the word document to be exploded
    exit 1
fi
suffix=${1##*.}
name="$1"

if [ "$suffix" = "xml" ]; then
    suffix=docx
    name=${1/%.xml/.docx}
fi

if [ "$suffix" = "$1" ]; then
    suffix=docx
    name=$1.docx
fi


corename=$(basename "$name" .$suffix)
if [ -z "$corename" ]; then
    echo can not work with empty name
    exit 1
fi

DIR="$( cd "$( dirname "$0" )" && pwd )"
DOSDIR=$(cygpath -m $DIR)
FLAT=$PWD/$corename.tmp/flat.$$
FLATOUT=$PWD/$corename.tmp/flat.$$.out


if [ "$remove_rsid" == "1" ]; then
    transform=$DOSDIR/remove-rsid.xslt
else
    transform=$DOSDIR/copy.xslt
fi

# $1 - file name
# 
# formats file as xml
_reformat_xml() {
    echo reformat $1
    #read pause
    xmllint --format $1 -o $1.new
    mv $1.new $1
}

flaten() {
    # xml
    xmls=""
    pwd
    pwd
    #read pause
    for f in $(find . -name '*.xml'); do  
        ff=$(echo ${f#./} | tr '/' '@')
        echo mv $f $FLAT/$ff
        mv $f $FLAT/$ff
        _reformat_xml $FLAT/$ff
        xmls="$xmls $ff"
    done

    # for rels, rename into .xml
    rels=""
    for f in $(find . -name '*.rels'); do  
        ff=$(echo ${f#./} | tr '/' '@')
        rels="$rels $ff"
        mv $f $FLAT/$ff.xml
        _reformat_xml $FLAT/$ff.xml
        #read pause
    done
}

expand_dirs() {
    target_dir=$(pwd)
    cd $FLATOUT

    echo PDW: $PWD
    #read pause

    for f in $rels ; do
        ff=$(echo $f | tr '@' '/')
        mv $f.xml "$target_dir/$ff"
    done

    for f in $xmls ; do
        echo PDW: $PWD
        #read pause
        ff=$(echo $f | tr '@' '/')
        mv $f "$target_dir/$ff"
    done
    cd "$target_dir"
}

echo corename: $corename
read pause
if [ -e "$corename" ]; then
    if [ -e "$corename.bak" ];then
        # echo removing $corename.bak
        rm -rf "$corename.bak"
    fi
    # echo backing up $corename
    mv "$corename" "$corename.bak"
fi 


mkdir "$corename"
cd "$corename"
unzip -q "../$name"

if [ -a $FLAT ]; then
    rm -rf $FLAT
fi
mkdir -p $FLAT

flaten

if [ -a $FLATOUT ]; then
    rm -rf $FLATOUT
fi
mkdir -p $FLATOUT
#exit

#dosflat=$(cygpath -m $FLAT)
#Transform -xsl:$transform -s:$dosflat -o:$dosflat.out
cp -R $FLAT/* $FLATOUT

expand_dirs

read pause #
rm -rf $FLAT $FLATOUT

Original Q&A

There are 1 best solutions below

**Pawel Jasinski** · Answer 1

If you ever used cygwin, it includes xmllint which in turn has the --format option. This was my original approach. However xmllint did not format attributes the way I liked, so I have developed my own script. Since the word documents contain a lot of rsid noise, the script has an option to remove it.

I use the following worklflow:

get a word document, let say foo.docx
explode-docx -r foo.docx
edit foo.docx - make a small change
explode-docx -r foo.docx
kdiff3 foo foo.bak

Automated extraction of word docx archive and pretty print conversion of xml files

There are 1 best solutions below

Related Questions in XML

Related Questions in AUTOMATION

Related Questions in DOCX

Related Questions in PRETTY-PRINT

Related Questions in WORDPROCESSINGML

Trending Questions

Popular # Hahtags

Popular Questions