After extract text from PDFs files using pdftotext, I am trying to recover some their titles and respective contents.
This batch of files have a pattern of a new line followed by a roman number followed (or not) by dot or hyphen and the title followed by break line.
So I tried this pattern:
^[^\S\n]*([CLXVI]{1,7})\.\s?(.*?)\n([\S\s]*)(?=[CLXVI]{1,7})
But did not worked as expected:
https://regex101.com/r/vX4aB4/1
The expected result was something like:
group title -> Breve Síntese da Demanda
group content -> Lorem ipsum dolor ... faucibus.
group title -> Bla Bla bla
group content -> Lorem ipsum dolor ... faucibus.
group title -> Do Mérito
group content -> Lorem ipsum dolor ... commodo.
group title -> Conclusão
group content -> Lorem ipsum dolor ... .
So how Can I improve that to recover properly each title and their respective contents?
You can use a negative lookahead to prevent skipping over, e.g.
See your updated demo at regex101 - Use in
(?m)
multiline modeThe relevant part
(?!(?1))
prevents skipping over first group pattern.This is a PCRE regex, it uses group reference and possessive quantifier.