extract first p tag and first h1 tag with commonmark

226 Views Asked by At

Suppose i have the following markdown

# Comman mark is **just great**
You can try CommonMark here.  This dingus is powered by
[commonmark.js](https://github.com/commonmark/commonmark.js), the
JavaScript reference implementation.
## Try CommonMark
1. item one
2. item two
   - sublist
   - sublist

I want to get the first h1 tag and first p tag for making them title and description of the post receptively.

I can not use browser API, because it is running on the Node server

To get the first h1 tag, I used commonmark.js.

document.getElementById('btn').addEventListener('click', function (e) {
  let parsed = reader.parse(md);
  let result = writer.render(parsed);

  let walker = parsed.walker();
  let event, node;

  while ((event = walker.next())) {
    node = event.node;

    // h1 tags
    if (event.entering && node.type === 'heading' && node.level == 1) {
      console.log('h1', '--', node?.firstChild?.literal);
    }

    // p tags
    if (event.entering && node.type === 'text') {
      console.log('p', '--', node?.literal);
    }
  }
});

For the above markdown the output I got on the console.

hgjg

You can see that, the first h1 returned is Common mark is, but it should be actually # Comman mark is **just great**

Same thing for p tag, how can I solve this problem?

See live - https://stackblitz.com/edit/js-vegggl?file=index.js

2

There are 2 best solutions below

0
On

Since you are already in the Node.js world, I suggest you check out the unified collective's remark and rehype processors. These processors support parsing markdown and HTML respectively to/from syntax trees. All such processors in the unified collective support custom and third-party 'plugins' that enable you to inspect and manipulate the intermediary syntax trees. Powerful stuff. Bit of a learning curve though. However, at some point, RegEx breaks down with non-regular languages like markdown. Syntax trees can save the day.

0
On
const regex = {
  title: /^#\s+.+/,
  heading: /^#+\s+.+/,
  custom: /\$\$\s*\w+/,
  ol: /\d+\.\s+.*/,
  ul: /\*\s+.*/,
  task: /\*\s+\[.]\s+.*/,
  blockQuote: /\>.*/,
  table: /\|.*/,
  image: /\!\[.+\]\(.+\).*/,
  url: /\[.+\]\(.+\).*/,
  codeBlock: /\`{3}\w+.*/,
};

const isTitle = (str) => regex.title.test(str);
const isHeading = (str) => regex.heading.test(str);
const isCustom = (str) => regex.custom.test(str);
const isOl = (str) => regex.ol.test(str);
const isUl = (str) => regex.ul.test(str);
const isTask = (str) => regex.task.test(str);
const isBlockQuote = (str) => regex.blockQuote.test(str);
const isImage = (str) => regex.image.test(str);
const isUrl = (str) => regex.url.test(str);
const isCodeBlock = (str) => regex.codeBlock.test(str);

export function getMdTitle(md) {
  if (!md) return "";
  let tokens = md.split("\n");
  for (let i = 0; i < tokens.length; i++) {
    if (isTitle(tokens[i])) return tokens[i];
  }
  return "";
}

export function getMdDescription(md) {
  if (!md) return "";
  let tokens = md.split("\n");
  for (let i = 0; i < tokens.length; i++) {
    if (
      isHeading(tokens[i]) ||
      isCustom(tokens[i]) ||
      isOl(tokens[i]) ||
      isUl(tokens[i]) ||
      isTask(tokens[i]) ||
      isBlockQuote(tokens[i]) ||
      isImage(tokens[i]) ||
      isUrl(tokens[i]) ||
      isCodeBlock(tokens[i])
    )
      continue;

    return tokens[i];
  }
  return "";
}