Skip processing fenced code blocks when processing Markdown files line by line

535 Views Asked by At

I'm a very inexperienced Python coder so it's quite possible that I'm approaching this particular problem in completely the wrong way but I'd appreciate any suggestions/help.

I have a Python script that goes through a Markdown file line by line and rewrites [[wikilinks]] as standard Markdown [wikilink](wikilink) style links. I'm doing this using two regexes in one function as shown below:

def modify_links(file_obj):
"""
Function will parse file contents (opened in utf-8 mode) and modify standalone [[wikilinks]] and in-line
[[wikilinks]](wikilinks) into traditional Markdown link syntax.

:param file_obj: Path to file
:return: List object containing modified text. Newlines will be returned as '\n' strings.
"""

file = file_obj
linelist = []
logging.debug("Going to open file %s for processing now.", file)
try:
    with open(file, encoding="utf8") as infile:
        for line in infile:
            linelist.append(re.sub(r"(\[\[)((?<=\[\[).*(?=\]\]))(\]\])(?!\()", r"[\2](\2.md)", line))
            # Finds  references that are in style [[foo]] only by excluding links in style [[foo]](bar).
            # Capture group $2 returns just foo
            linelist_final = [re.sub(r"(\[\[)((?<=\[\[)\d+(?=\]\]))(\]\])(\()((?!=\().*(?=\)))(\))",
                                     r"[\2](\2 \5.md)", line) for line in linelist]
            # Finds only references in style [[foo]](bar). Capture group $2 returns foo and capture group $5
            # returns bar
except EnvironmentError:
    logging.exception("Unable to open file %s for reading", file)
logging.debug("Finished processing file %s", file)
return linelist_final

This works fine for most Markdown files. However, I can occasionally get a Markdown file that has [[wikilinks]] within fenced code blocks such as the following:

# Reference

Here is a reference to “the Reactome Project” using smart quotes.

Here is an image: ![](./images/Screenshot.png)


[[201802150808]](Product discovery)

```
[[201802150808 Product Prioritization]]

def foo():
    print("bar")

```

In the above case I should skip processing the [[201802150808 Product Prioritization]] inside the fenced code block. I have a regex that identifies the fenced code block correctly namely:

(?<=```)(.*?)(?=```)

However, since the existing function is running line by line, I have not been able to figure out a way to skip the entire section in the for loop. How do I go about doing this?

2

There are 2 best solutions below

3
On

You need to use a full Markdown parser to be able to cover all of the edge cases. Of course, most Markdown parsers convert Markdown directly to HTML. However, a few will use a two step process where step one converts the raw text to an Abstract Syntax Tree (AST) and step two renders the AST to the output format. It is not uncommon to find a Markdown renderer (outputs Markdown) which can replace the default HTML renderer.

You would simply need to modify either the parser step (using a plugin to add support for the wikilink syntax) or modify the AST directly. Then pass the AST to a Markdown renderer, which will give you a nicely formatted and normalized Markdown document. If you are looking for a Python solution, mistunePandoc Filters might be a good place to start.

But why go through all that when a few well crafted regular expressions can be run on the source text? Because Markdown parsing is complicated. I know, it seems easy at first. After all Markdown is easy to read for a human (which was one of its defining design goals). However, parsing is actually very complicated with parts of the parser reliant on previous steps.

For example, in addition to fenced code blocks, what about indented code blocks? But you can't just check for indentation at the beginning of a line, because a single line of a nested list could look identical to an indented code block. You want to skip the code block, but not the paragraph nested in a list. And what if your wikilink is broken across two lines? Generally when parsing inline markup, Markdown parsers will treat a single line break no different than a space. The point of all of this is that before you can start parsing inline elements, the entire document needs to first be parsed into its various block-level elements. Only then can you step through those and parse inline elements like links.

I'm sure there are other edge cases I haven't thought of. The only way to cover them all is to use a full-fledged Markdown parser.

0
On

I was able to create a reasonably complete solution to this problem by making a few changes to my original function, namely:

  • Replace the python re built-in with the regex module available on PyPi.
  • Change the function to read the entire file into a single variable instead of reading it line by line.

The revised function is as follows:

import regex 

def modify_links(file_obj):
"""
Function will parse file contents (opened in utf-8 mode) and modify standalone [[wikilinks]] and in-line
[[wikilinks]](wikilinks) into traditional Markdown link syntax.

:param file_obj: Path to file
:return: String containing modified text. Newlines will be returned as '\\n' in the string.
"""

file = file_obj
try:
    with open(file, encoding="utf8") as infile:
        line = infile.read()
        # Read the entire file as a single string
        linelist = regex.sub(r"(?V1)"
                             r"(?s)```.*?```(*SKIP)(*FAIL)(?-s)|(?s)`.*?`(*SKIP)(*FAIL)(?-s)"
        #                    Ignore fenced & inline code blocks. V1 engine allows in-line flags so 
        #                    we enable newline matching only here.
                             r"|(\ {4}|\t).*(*SKIP)(*FAIL)"
        #                    Ignore code blocks beginning with 4 spaces/1 tab
                             r"|(\[\[(.*)\]\](?!\s\(|\())", r"[\3](\3.md)", line)
        # Finds  references that are in style [[foo]] only by excluding links in style [[foo]](bar) or
        # [[foo]] (bar). Capture group $3 returns just foo
        linelist_final = regex.sub(r"(?V1)"
                                   r"(?s)```.*?```(*SKIP)(*FAIL)(?-s)|(?s)`.*?`(*SKIP)(*FAIL)(?-s)"
                                   r"|(\ {4}|\t).*(*SKIP)(*FAIL)"
        #                          Refer comments above for this portion.
                                   r"|(\[\[(\d+)\]\](\s\(|\()(.*)(?=\))\))", r"[\3](\3 \5.md)", linelist)
        # Finds only references in style [[123]](bar) or [[123]] (bar). Capture group $3 returns 123 and capture
        # group $5 returns bar
except EnvironmentError:
    logging.exception("Unable to open file %s for reading", file)
return linelist_final

The above function handles [[wikilinks]] in inline code blocks, fenced code blocks and code blocks indented with 4 spaces. There is currently one false positive scenario where it ignores a valid [[wiklink]] which is when the link appears on the 3rd level or deeper of a Markdown list, i.e.:

* Level 1
  * Level 2
    * [[wikilink]] #Not recognized
      * [[wikilink]] #Not recognized.

However my documents do not have wikilinks nested at that level in lists so it's not a problem for me.