OpenOffice odt document, regex, and arrays

343 Views Asked by At

I am trying to work with a ~300 page odt document. I know how to load documents in python, and least in a basic way. That didn't work for odt (it isn't a txt file). I researched this and installed the odfpy library, although it doesn't seem well-documented. I'm able to get it to the point where I have an array of it. But I don't know how trying to use regex across multiple array entries would work. So I tried to convert it with "str()" to a string, and all I got was a long list of addresses.

I want to be able to load up an odt document and run a regex to remove a certain pattern from it. How do I go about doing this ...? So far, what I've been trying doesn't work. I'd like to maintain the structure of the odt intact. I'm more used to txt.

import sys
import re
from odf.opendocument import load
from odf import text, teletype
infile = load(r'C:\Users\Iainc\Documents\Blah Blah.odt')
allparas = infile.getElementsByType(text.P)
stringallparas = str(allparas)

This is, so far, what I have that, I believe, is successful. But certain things that would work with .txt aren't working.

1

There are 1 best solutions below

7
On

Something like the following might work. Replace 'Your pattern here' with the regex pattern to replace.

import sys
import re
from odf.opendocument import load
from odf import text, teletype
infile = load(r'C:\Users\Iainc\Documents\Blah Blah.odt')
for item in infile.getElementsByType(text.P):
    s = teletype.extractText(item)
    m = re.sub(r'Your pattern here', '', s)
    if m != s:
        new_item = text.P()
        new_item.setAttribute('stylename', item.getAttribute('stylename'))
        new_item.addText(m)
        item.parentNode.insertBefore(new_item, item)
        item.parentNode.removeChild(item)

infile.save('result.odt')

The for loop in this code was taken from ReplaceOneTextToAnother on the odfpy wiki and slightly modified to use re.sub instead of str.replace and text.P instead of text.Span.