What's a straightforward way to split a string on 'top level' only, regarding quotes and parentheses?

117 Views Asked by At

I want to provide a function which takes a comma-separated string and splits it on separators, similar to str.split() but keeping (potentially nested) quoted and parenthesized parts. Examples include comma-separated key-value pairs (a=b,c=d) but also comma-separated shell commands which might include the separator or even quote characters you could use to write a simple regex.

To avoid writing a fully-fledged parser my first idea was to use the csv module (see How to handle double quotes inside field values with csv module?) but I failed using it even for simple cases e.g.:

>>> s = "a='[1,2,3]',c=d"
>>> list(csv.reader([s], delimiter=',', quotechar="'"))  # expected: ["a='[1,2,3]'", "c=d"]
[["a='[1", '2', "3]'", 'c=d']]

and I didn't try with more complex stuff like a='[1,"3,14",3]',c="[4,5,6]".

Is there an easy way to either tame csv.reader to split the above string into ['a=..', 'c=..'] or even better to use some built-in string processing capability?

2

There are 2 best solutions below

0
AKX On

Well, for your particular data, you can cheat a bit since if you squint, it looks like what kwargs for a Python call would look like, so wrap it in a call expression and parse the AST.

import ast

target = """
a='[1,"3,14",3]',c="[4,5,6]"
"""

tree = ast.parse(f"XXX({target.strip()})", "...", "eval")

print({kw.arg: ast.unparse(kw.value) for kw in tree.body.keywords})

prints out

{'a': '\'[1,"3,14",3]\'', 'c': "'[4,5,6]'"}
0
frans On

Here is a function that works for me and accepts a custom delimiter:

def smart_split(string: str, delimiter: str = ",") -> Iterable[str]:
    """Like str.split but takes quotes and parenthesis into account
    >>> list(smart_split("(=,),'a=x',\\"b=y\\""))
    ['(=,)', "'a=x'", '"b=y"']
    """
    splits: list[int] = []
    closers: list[str] = []
    parenthizers = {'"': '"', "'": "'", "[": "]", "(": ")"}
    for i, char in enumerate(string):
        if char == delimiter and not closers:
            splits.append(i)
        elif closers and char == closers[-1]:
            closers.pop()
        elif char in parenthizers:
            closers.append(parenthizers[char])
    start = 0
    for pos in splits:
        yield string[start:pos]
        start = pos + 1
    yield string[start:]

Be warned though, there is no consistency check and no support for escaped parenthesis or whatsoever.