I want to generate a html code from a pdf or word document. The document contains bulleted lists and somes bulleted lists contains and other bulleted lists. I want to transfom that bulleted lists in html but when I extract the content of the document, I just have a brute text without the initiall structure and the bulleteds. I need a way to identify the bulleted in the document and their depth

Thank's for your help

1

There are 1 best solutions below

1
On

Take a look at the python-docx library for working with Word documents:

https://python-docx.readthedocs.io/en/latest/

There is some discussion about nested bullet points here Bullet Lists in python-docx, this is about creating rather than reading but should be possible to parse an existing document using the same principles.

There are various libraries out there for working with PDFs but I’ve heard good things about borb: https://github.com/jorisschellekens/borb