Python comment-preserving parsing using only builtin libraries?

446 Views Asked by At

I wrote a library using just ast and inspect libraries to parse and emit [uses astor on Python < 3.9] internal Python constructs.

Just realised that I really need to preserve comments afterall. Preferably without resorting to a RedBaron or LibCST; as I just need to emit the unaltered commentary; is there a clean and concise way of comment-preserving parsing/emitting Python source with just stdlib?

2

There are 2 best solutions below

0
On BEST ANSWER

What I ended up doing was writing a simple parser, without a meta-language in 339 source lines: https://github.com/offscale/cdd-python/blob/master/cdd/cst_utils.py

Implementation of Concrete Syntax Tree [List!]

  1. Reads source character by character;
  2. Once end of statement† is detected, add statement-type into 1D list;
    • †end of line if line.lstrip().startswith("#") or line not endswith('\\') and balanced_parens(line) else continue munching until that condition is true… plus some edge-cases around multiline strings and the like;
  3. Once finished there is a big (1D) list where each element is a namedtuple with a value property.

Integration with builtin Abstract Syntax Tree ast

  1. Limit ast nodes to modify—not remove—to: {ClassDef,AsyncFunctionDef,FunctionDef} docstring (first body element Constant|Str), Assign and AnnAssign;
  2. cst_idx, cst_node = find_cst_at_ast(cst_list, _node);
  3. if doc_str node then maybe_replace_doc_str_in_function_or_class(_node, cst_idx, cst_list)
  4. Now the cst_list contains only changes to those aforementioned nodes, and only when that change is more than whitespace, and can be created into a string with "".join(map(attrgetter("value"), cst_list)) for outputting to eval or straight out to a source file (e.g., in-place overriding).

Quality control

  1. 100% test coverage
  2. 100% doc coverage
  3. Support for last 6 versions of Python (including latest alpha)
  4. CI/CD
  5. (Apache-2.0 OR MIT) licensed

Limitations

  1. Lack of meta-language, specifically lack of using Python's provided grammar means new syntax elements won't automatically be supported (match/case is supported, but if there's new syntax introduced since, it isn't [yet?] supported… at least not automatically);
  2. Not builtin to stdlib so stdlib could break compatibility;
  3. Deleting nodes is [probably] not supported;
  4. Nodes can be incorrectly identified if there are shadow variables or similar issues that linters should point out.
4
On

Comments can be preserved by merging them back into the generated source code by capturing them with the tokenizer.

Given a toy program in a program variable, we can demonstrate how comments get lost in the AST:

import ast

program = """
# This comment lost
p1v = 4 + 4
p1l = ['a', # Implicit line joining comment for a lost
       'b'] # Ending comment for b lost
def p1f(x):
    "p1f docstring"
    # Comment in function p1f lost
    return x
print(p1f(p1l), p1f(p1v))
"""
tree = ast.parse(program)
print('== Full program code:')
print(ast.unparse(tree))

The output shows all comments gone:

== Full program code:
p1v = 4 + 4
p1l = ['a', 'b']

def p1f(x):
    """p1f docstring"""
    return x
print(p1f(p1l), p1f(p1v))

However, if we scan the comments with the tokenizer, we can use this to merge the comments back in:

from io import StringIO
import tokenize

def scan_comments(source):
    """ Scan source code file for relevant comments
    """
    # Find token for comments
    for k,v in tokenize.tok_name.items():
        if v == 'COMMENT':
            comment = k
            break
    comtokens = []
    with StringIO(source) as f:
        tokens = tokenize.generate_tokens(f.readline)
        for token in tokens:
            if token.type != comment:
                continue
            comtokens += [token]
    return comtokens

comtokens = scan_comments(program)
print('== Comment after p1l[0]\n\t', comtokens[1])

Output (edited to split long line):

== Comment after p1l[0]
     TokenInfo(type=60 (COMMENT),
               string='# Implicit line joining comment for a lost',
               start=(4, 12), end=(4, 54),
               line="p1l = ['a', # Implicit line joining comment for a lost\n")

Using a slightly modified version of ast.unparse(), replacing methods maybe_newline() and traverse() with modified versions, you should be able to merge back in all comments at their approximate locations, using the location info from the comment scanner (start variable), combined with the location info from the AST; most nodes have a lineno attribute.

Not exactly. See for example the list variable assignment. The source code is split out over two lines, but ast.unparse() generates only one line (see output in the second code segment).

Also, you need to ensure to update the location info in the AST using ast.increment_lineno() after adding code.

It seems some more calls to maybe_newline() might be needed in the library code (or its replacement).