is there a way to read the contents of a pdf or word document in python while keeping its structure (level and depth of bulleted lists)

432 Views Asked by Guiffou Joel At 28 June 2025 at 23:03

I want to generate a html code from a pdf or word document. The document contains bulleted lists and somes bulleted lists contains and other bulleted lists. I want to transfom that bulleted lists in html but when I extract the content of the document, I just have a brute text without the initiall structure and the bulleteds. I need a way to identify the bulleted in the document and their depth

Thank's for your help

Original Q&A

There are 1 best solutions below

ljdyer On 07 December 2021 at 04:44

Take a look at the python-docx library for working with Word documents:

https://python-docx.readthedocs.io/en/latest/

There is some discussion about nested bullet points here Bullet Lists in python-docx, this is about creating rather than reading but should be possible to parse an existing document using the same principles.

There are various libraries out there for working with PDFs but I’ve heard good things about borb: https://github.com/jorisschellekens/borb

is there a way to read the contents of a pdf or word document in python while keeping its structure (level and depth of bulleted lists)

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in PYTHON-DOCX

Related Questions in PYTHON-PDFREADER

Trending Questions

Popular # Hahtags

Popular Questions