Problem with softPageBreaks when reading a odf file with lxml to create a LaTeX file

80 Views Asked by At

I have a Google Doc that I want to convert into a LaTeX file while stripping out some content of the Google Doc. Specifically I want to remove all headlines in the Google Doc. I also currently have four-line paragraphs in the Google Doc where each line ends with a newline character. This means that in Google Doc they aren't seen as one paragraph but four and I want those to be one in my LaTeX file.

First I extracted an XML file and parsed it:

if not os.path.exists(tempFolder):
    os.makedirs(tempFolder)

# Extract the contents of the ODF file to the "temp" folder
with zipfile.ZipFile(fileName, "r") as zip:
    zip.extractall(path=tempFolder)

# Open the "content.xml" file inside the "temp" folder and parse it into an XML tree
doc = etree.parse(contentXml)
removeHeadlines(doc)

I remove the headlines via:

def removeHeadlines(doc):

    p2 = doc.findall('.//text:p[@text:style-name="P2"]', namespaces=doc.getroot().nsmap)

    for p in ( p2 ):
        p.getparent().remove(p)

Then I create the LaTeX file via:

elementList = oldDoc.findall(
    './/text:p[@text:style-name="Standard"]', 
    namespaces=oldDoc.getroot().nsmap)
textList = [element.text for element in elementList]

latexDoc = pylatex.Document()

section = pylatex.section.Section(fileName)
latexDoc.append(section)

paragraph = pylatex.section.Paragraph('')
paragraphIsNew = True

for string in textList:
    if string:
        # If the string is not empty, add it to the paragraph
        paragraph.append(string + "\n")
        paragraphIsNew = False
    else:
        if not paragraphIsNew:
            latexDoc.append(paragraph)
            paragraph = pylatex.section.Paragraph('')
            paragraphIsNew = True

latexDoc.append(paragraph)

latexDoc.generate_pdf(newFileName)

Unfortunately, some lines don't make it into the LaTeX pdf. Those look like:

<text:p text:style-name="Standard"><text:soft-page-break/>Sad line that doesn't make it,</text:p>

I tried unsuccessfully tried to remove the softPageBreak elements by doing:

softPageBreaks = doc.findall('.//text:soft-page-break', namespaces=doc.getroot().nsmap)

# Remove the element, but keep its parent and the text content
for element in softPageBreaks:
    parent = element.getparent()
    text = element.tail
    parent.remove(element)
    parent.text = text    
0

There are 0 best solutions below