I have a Google Doc that I want to convert into a LaTeX file while stripping out some content of the Google Doc. Specifically I want to remove all headlines in the Google Doc. I also currently have four-line paragraphs in the Google Doc where each line ends with a newline character. This means that in Google Doc they aren't seen as one paragraph but four and I want those to be one in my LaTeX file.
First I extracted an XML file and parsed it:
if not os.path.exists(tempFolder):
os.makedirs(tempFolder)
# Extract the contents of the ODF file to the "temp" folder
with zipfile.ZipFile(fileName, "r") as zip:
zip.extractall(path=tempFolder)
# Open the "content.xml" file inside the "temp" folder and parse it into an XML tree
doc = etree.parse(contentXml)
removeHeadlines(doc)
I remove the headlines via:
def removeHeadlines(doc):
p2 = doc.findall('.//text:p[@text:style-name="P2"]', namespaces=doc.getroot().nsmap)
for p in ( p2 ):
p.getparent().remove(p)
Then I create the LaTeX file via:
elementList = oldDoc.findall(
'.//text:p[@text:style-name="Standard"]',
namespaces=oldDoc.getroot().nsmap)
textList = [element.text for element in elementList]
latexDoc = pylatex.Document()
section = pylatex.section.Section(fileName)
latexDoc.append(section)
paragraph = pylatex.section.Paragraph('')
paragraphIsNew = True
for string in textList:
if string:
# If the string is not empty, add it to the paragraph
paragraph.append(string + "\n")
paragraphIsNew = False
else:
if not paragraphIsNew:
latexDoc.append(paragraph)
paragraph = pylatex.section.Paragraph('')
paragraphIsNew = True
latexDoc.append(paragraph)
latexDoc.generate_pdf(newFileName)
Unfortunately, some lines don't make it into the LaTeX pdf. Those look like:
<text:p text:style-name="Standard"><text:soft-page-break/>Sad line that doesn't make it,</text:p>
I tried unsuccessfully tried to remove the softPageBreak elements by doing:
softPageBreaks = doc.findall('.//text:soft-page-break', namespaces=doc.getroot().nsmap)
# Remove the element, but keep its parent and the text content
for element in softPageBreaks:
parent = element.getparent()
text = element.tail
parent.remove(element)
parent.text = text