Regular expressions search in LibreOffice writer documents using pyuno extremely greedy

176 Views Asked by At

I have a LibreOffice writer document that contains text snippets of the form prefix<...>. In writer I can easily locate them with search for regular expressions:

enter image description here

Now I would like to make a python list of all these occurrences using pyuno in a standalone python script from outside LibreOffice.

The code that I have collected from a variety of sources looks like this and seems to work so far:

import uno, os, time

SOCKET = 'socket,host=localhost,port=2002;urp;'
file = '/home/jochen/Dokumente/regexp_find_test.odt'
office_proc = os.popen('/usr/lib/libreoffice/program/soffice ' + file + ' --accept="' + SOCKET + 'StarOffice.ServiceManager"')
time.sleep(3)

localContext = uno.getComponentContext()
resolver = localContext.ServiceManager.createInstanceWithContext("com.sun.star.bridge.UnoUrlResolver", localContext)

try:
    context = resolver.resolve('uno:' + SOCKET + 'StarOffice.ComponentContext')
except:
    raise Exception("failed to connect to LibreOffice.org with socket {}".format(SOCKET))
loffice_desktop = context.ServiceManager.createInstanceWithContext("com.sun.star.frame.Desktop", context)
comp = loffice_desktop.getCurrentComponent()
search_descr = comp.createSearchDescriptor()
search_descr.SearchRegularExpression = True
search_descr.setSearchString('prefix<[a-z_]+>')
res = comp.findAll(search_descr)
print(len(res))
for n in range(len(res)):
    print(40*'-')
    print(res[n].Text.getText().getString())

The output that I am getting surprises me, since I use the same expression as in writer:

12
----------------------------------------
prefix<vorname> prefix<name>
prefix<ort> prefix<strasse> prefix<haus_nummer>

Lorem ipsum dolor sit amet, consectetur adipisici elit, sed eiusmod tempor incidunt ut labore et dolore magna aliqua. prefix<name> Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquid ex ea commodi consequat. Quis aute iure reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint obcaecat cupiditat non proident, sunt in prefix<ort> culpa qui officia deserunt mollit anim id est laborum.

Duis autem vel eum iriure dolor in hendrerit in vulputate velit esse molestie consequat, vel illum dolore eu feugiat nulla facilisis at vero eros et accumsan et iusto odio dignissim qui blandit praesent luptatum zzril delenit augue duis prefix<vorname> dolore te feugait nulla facilisi. Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt ut laoreet dolore magna aliquam erat volutpat.

Ut wisi enim ad minim veniam, quis nostrud exerci tation ullamcorper suscipit lobortis nisl ut aliquip ex ea commodo prefix<vorname> consequat. Duis autem vel prefix<name> eum iriure dolor in hendrerit in vulputate velit esse molestie consequat, vel illum dolore eu feugiat nulla facilisis at vero eros et accumsan et iusto odio dignissim qui blandit praesent luptatum zzril delenit augue duis dolore te feugait nulla facilisi.

prefix<name> prefix<unterschrift>
----------------------------------------
prefix<vorname> prefix<name>
prefix<ort> prefix<strasse> prefix<haus_nummer>

I expected something nice like

12
----------------------------------------
prefix<vorname>
----------------------------------------
prefix<name>
----------------------------------------
prefix<ort>
[...]

Obviously the expression behaves extremely greedy, are there any suggestions to overcome this, or am I doing something completely wrong?

1

There are 1 best solutions below

0
On

It's not greediness but simple wrong processing of the search results.

The line

print(res[n].Text.getText().getString())

must change to

print(res[n].String