I am trying to parse a pdf and categorize information based on text formatting/decoration. How do you suggest I do that?
For example, I have a pdf in which the structure is repeated:
S.No. BOLD+UNDERLINED TITLE para
How do I categorize this data into an array of objects based on text decoration:
[
{ sno: "", title: "", desc: "" },
...
]
I went through the documentation for pdf2json and figured that I might have to use
pdfData.formImage.Pages[pageNumber].Texts[wordNumber].R[0]
object after parsing the pdf to get hold of values I need.The property
TS
of the above object is an array, the value atTS[2]
corresponds to whether the text isbold
(value = 1) or not (value = 0). I could not find any details on data related tounderline
text-decoration.I also needed to initialize the parser as follows:
let pdfParser = new PDFParser(null, 1)
.Check this for more details.