OpenOffice odt document, regex, and arrays

352 Views Asked by Iain Curtis-Shanley At 07 June 2025 at 09:32

I am trying to work with a ~300 page odt document. I know how to load documents in python, and least in a basic way. That didn't work for odt (it isn't a txt file). I researched this and installed the odfpy library, although it doesn't seem well-documented. I'm able to get it to the point where I have an array of it. But I don't know how trying to use regex across multiple array entries would work. So I tried to convert it with "str()" to a string, and all I got was a long list of addresses.

I want to be able to load up an odt document and run a regex to remove a certain pattern from it. How do I go about doing this ...? So far, what I've been trying doesn't work. I'd like to maintain the structure of the odt intact. I'm more used to txt.

import sys
import re
from odf.opendocument import load
from odf import text, teletype
infile = load(r'C:\Users\Iainc\Documents\Blah Blah.odt')
allparas = infile.getElementsByType(text.P)
stringallparas = str(allparas)

This is, so far, what I have that, I believe, is successful. But certain things that would work with .txt aren't working.

Original Q&A

There are 1 best solutions below

Nathan Mills On 07 January 2022 at 23:30

Something like the following might work. Replace 'Your pattern here' with the regex pattern to replace.

import sys
import re
from odf.opendocument import load
from odf import text, teletype
infile = load(r'C:\Users\Iainc\Documents\Blah Blah.odt')
for item in infile.getElementsByType(text.P):
    s = teletype.extractText(item)
    m = re.sub(r'Your pattern here', '', s)
    if m != s:
        new_item = text.P()
        new_item.setAttribute('stylename', item.getAttribute('stylename'))
        new_item.addText(m)
        item.parentNode.insertBefore(new_item, item)
        item.parentNode.removeChild(item)

infile.save('result.odt')

The for loop in this code was taken from ReplaceOneTextToAnother on the odfpy wiki and slightly modified to use re.sub instead of str.replace and text.P instead of text.Span.

OpenOffice odt document, regex, and arrays

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in ARRAYS

Related Questions in REGEX

Related Questions in ODT

Related Questions in ODFPY

Trending Questions

Popular # Hahtags

Popular Questions