Regex - How to capture Markdown H2 title and content in two named capture group

75 Views Asked by At

Hi i have a markdow text like the one below and i want to slice it in H2 tilte, H2 content

## **Intro**


* bla bla
* bla bla bla


## Tortilla

* chico
* chica

### 1. sub-section

* and another bla.

Regex result should be :

title 1:

Intro

content 1:

* bla bla
* bla bla bla

title 2:

Tortilla

content 2:

* chico
* chica

### 1. sub-section

* and another bla.

I tried with this regex

/^## (?<title>.*)(?<content>.*(?:\n(?!##).*)*)/gm

But doesn't catch the sub-section content.

Can someone help please?

2

There are 2 best solutions below

1
Tibrogargan On

The regex below produces the output desired from the input you have given.

^##\s(?<title>.*)\n(?<content>(?:(?!##\s).*\n?)+)

Explanation of most pieces:

piece description
^##\s match lines that begin with H2
(?<title>.*)\n capture everything after the H2 and discard the new line
(?:(?!##\s).*\n?)+ match all text that doesn't start a line with H2, including any new lines

The primary issue with your attempt is that you are trying to account for the new line after title by including it in content when it just needs to be discarded.

The secondary issue is that you're not providing a differentiator for lines that start with ## that are not H2 (i.e. there must be whitespace after the ##)

Note: the optional new line (\n?) at the end of content is required when the input does not end with a new line

0
Peter Seliger On

Since one anyway needs to create a data-structure from all matched group values, one also can choose an approach based on the combination of a simple regex like ... /^##\s+(.*)/gm ... which will be utilized for splitting the markdown string. The relevant array data then gets reduced into the final result, an array which features all the H2 related items, each item consisting of a sanitized title value and an as well sanitized content value ...

const markdown = `

# Main Topic

maintopic content

maintopic content


## **Intro**


* bla bla
* bla bla bla


## Tortilla

* chico
* chica

### 1. sub-section

* and another bla.`;


const result = markdown

  // see ... [https://regex101.com/r/WQrkqS/1]
  .split(/^##\s+(.*)/gm)
  .splice(1)

  .reduce((result, title, idx, arr) => {
    if (idx % 2 === 0) {
      // see ... [https://regex101.com/r/WQrkqS/2]
      const regXTrimNonNumbersLetters = /^[^\p{L}\p{N}]+|[^\p{L}\p{N}]+$/gu;

      result
        .push({
          title: title.replace(regXTrimNonNumbersLetters, '').trim(),
          content: arr[idx + 1].trim(),
        });
    }
    return result;
  }, []);

console.log({ result });
.as-console-wrapper { min-height: 100%!important; top: 0; }

Edit

And an implementation which used Tibrogargan's regex, could look similar to the following one ...

const markdown = `
# Main Topic

maintopic content

maintopic content


## **Intro**


* bla bla
* bla bla bla


## Tortilla

* chico
* chica

### 1. sub-section

* and another bla.`;

const result = [
  ...markdown
    .matchAll(/^##\s(?<title>.*)\n(?<content>(?:(?!##\s).*\n?)+)/gm)
]
.map(({ groups: { title, content } }) => ({

  title: title.replace(/^[^\p{L}\p{N}]+|[^\p{L}\p{N}]+$/gu, '').trim(),
  content: content.trim(),

}));

console.log({ result });
.as-console-wrapper { min-height: 100%!important; top: 0; }