Regex - How to capture Markdown H2 title and content in two named capture group

75 Views Asked by N. Slate At 28 February 2024 at 00:15

Hi i have a markdow text like the one below and i want to slice it in H2 tilte, H2 content

## **Intro**


* bla bla
* bla bla bla


## Tortilla

* chico
* chica

### 1. sub-section

* and another bla.

Regex result should be :

title 1:

Intro

content 1:

* bla bla
* bla bla bla

title 2:

Tortilla

content 2:

* chico
* chica

### 1. sub-section

* and another bla.

I tried with this regex

/^## (?<title>.*)(?<content>.*(?:\n(?!##).*)*)/gm

But doesn't catch the sub-section content.

Can someone help please?

Original Q&A

There are 2 best solutions below

Tibrogargan On 28 February 2024 at 00:54

The regex below produces the output desired from the input you have given.

^##\s(?<title>.*)\n(?<content>(?:(?!##\s).*\n?)+)

Explanation of most pieces:

piece	description
`^##\s`	match lines that begin with H2
`(?<title>.*)\n`	capture everything after the H2 and discard the new line
`(?:(?!##\s).*\n?)+`	match all text that doesn't start a line with H2, including any new lines

The primary issue with your attempt is that you are trying to account for the new line after title by including it in content when it just needs to be discarded.

The secondary issue is that you're not providing a differentiator for lines that start with ## that are not H2 (i.e. there must be whitespace after the ##)

Note: the optional new line (\n?) at the end of content is required when the input does not end with a new line

Peter Seliger On 28 February 2024 at 16:51

Since one anyway needs to create a data-structure from all matched group values, one also can choose an approach based on the combination of a simple regex like ... /^##\s+(.*)/gm ... which will be utilized for splitting the markdown string. The relevant array data then gets reduced into the final result, an array which features all the H2 related items, each item consisting of a sanitized title value and an as well sanitized content value ...

const markdown = `

# Main Topic

maintopic content

maintopic content


## **Intro**


* bla bla
* bla bla bla


## Tortilla

* chico
* chica

### 1. sub-section

* and another bla.`;


const result = markdown

  // see ... [https://regex101.com/r/WQrkqS/1]
  .split(/^##\s+(.*)/gm)
  .splice(1)

  .reduce((result, title, idx, arr) => {
    if (idx % 2 === 0) {
      // see ... [https://regex101.com/r/WQrkqS/2]
      const regXTrimNonNumbersLetters = /^[^\p{L}\p{N}]+|[^\p{L}\p{N}]+$/gu;

      result
        .push({
          title: title.replace(regXTrimNonNumbersLetters, '').trim(),
          content: arr[idx + 1].trim(),
        });
    }
    return result;
  }, []);

console.log({ result });

.as-console-wrapper { min-height: 100%!important; top: 0; }

Edit

And an implementation which used Tibrogargan's regex, could look similar to the following one ...

const markdown = `
# Main Topic

maintopic content

maintopic content


## **Intro**


* bla bla
* bla bla bla


## Tortilla

* chico
* chica

### 1. sub-section

* and another bla.`;

const result = [
  ...markdown
    .matchAll(/^##\s(?<title>.*)\n(?<content>(?:(?!##\s).*\n?)+)/gm)
]
.map(({ groups: { title, content } }) => ({

  title: title.replace(/^[^\p{L}\p{N}]+|[^\p{L}\p{N}]+$/gu, '').trim(),
  content: content.trim(),

}));

console.log({ result });

.as-console-wrapper { min-height: 100%!important; top: 0; }

Regex - How to capture Markdown H2 title and content in two named capture group

There are 2 best solutions below

Related Questions in JAVASCRIPT

Related Questions in REGEX

Related Questions in STRING

Related Questions in MATCH

Related Questions in REGEX-GROUP

Trending Questions

Popular # Hahtags

Popular Questions