Should the PARSE dialect be used on tasks that are fundamentally about modifying the input?

365 Views Asked by At

In honor of Rebol 3 going open source any-minute-now (?), I'm back to messing with it. As an exercise I'm trying to write my own JSON parser in the PARSE dialect.

Since Douglas Crockford credits influence of Rebol on his discovery of JSON, I thought it would be easy. Outside of replacing braces with brackets and getting rid of all those commas, one of the barriers to merely using LOAD on the string is the fact that when they want to do the equivalent of a SET-WORD! they use something that looks like a string to Rebol's tokenizer, with an illegal stray colon after it:

{
    "key one": {
         "summary": "This is the string content for key one's summary",
         "value": 7
    },
    "key two": {
         "summary": "Another actually string, not supposed to be a 'symbol'",
         "value": 100
    }
}

Basically I wanted to find all the cases that were like "foo bar": and turn them into foo-bar: while leaving matching quote pairs that were not followed by colons alone.

When I tackled this in PARSE (which I understand rather well in principle but still haven't used much) a couple of questions came up. But mainly, what are the promised conditions under which when you can escape into code and modify the series out from under the parser...specifically in Rebol 3? More generally, is it the "right kind of tool for the job"?

Here was the rule I tried, that appears to work for this part of the task:

any [
    ; require a matched pair of quotes & capture series positions before
    ; and after the first quote, and before the last quote

    to {"} beforePos: skip startPos: to {"} endPos: skip

    ; optional colon next (if not there the rest of the next rule is skipped)

    opt [
        {:}

        ; if we got to this part of the optional match rule, there was a colon.
        ; we escape to code changing spaces to dashes in the range we captured

        (
            setWordString: copy/part startPos endPos
            replace/all setWordString space "-"
            change startPos setWordString
        )

        ; break back out into the parse dialect, and instead of changing the 
        ; series length out from under the parser we jump it back to the position
        ; before that first quote that we saw

        :beforePos

        ; Now do the removals through a match rule.  We know they are there and
        ; this will not cause this "colon-case" match rule to fail...because we
        ; saw those two quotes on the first time through!

        remove [{"}] to {"} remove [{"}]
    ]
]

Is that okay? Is there any chance of the change startPos setWordString in the open code mucking up the outer parse...if not in this case, then in something subtly different?

As always, any didactic "it's cleaner/shorter/better this other way" advice is appreciated.

P.S. why isn't there a replace/all/part?

3

There are 3 best solutions below

5
On BEST ANSWER

The new keywords like change, insert and remove should facilitate this type of thing. I guess the main downside to this approach is the latency issues in pushing series around (I've seen mention that it is faster to build new strings than to manipulate).

token: [
    and [{"} thru {"} any " " ":"]
    remove {"} copy key to {"} remove {"} remove any " "
    (key: replace/all key " " "-")
]

parse/all json [
    any [
        to {"} [
            and change token key
            ; next rule here, example:
            copy new-key thru ":" (probe new-key)
            | skip
        ]
    ]
]

This is a bit convoluted as I can't seem to get 'change to work as I'd expect (behaves like change, not change/part), but in theory you should be able to make it shorter along these lines and have a fairly clean rule. Ideal might be:

token: [
    {"} copy key to {"} skip any " " and ":"
    (key: replace/all key " " "-")
]

parse/all json [
    any [
        to {"} change token key
        | thru {"}
    ]
]

Edit: Another fudge around change -

token: [
    and [{"} key: to {"} key.: skip any " " ":"]
    (key: replace/all copy/part key key. " " "-")
    remove to ":" insert key
]

parse/all json [
    any [to {"} [token | skip]]
]
1
On

Another way is to think about parse as a compiler-compiler with EBNF. If I recall the R2 syntax correctly:

copy token [rule] (append output token)

Assuming correct syntax, and no {"} in strings:

thru {"} skip copy key to {"} skip
; we know ":" must be there, no check
thru {"} copy content to {"} skip
(append output rejoin[ {"} your-magic-with key {":"} content {"} ])

More precise, instead of to, char by char:

any space  {"} copy key some [ string-char | "\" skip ] {"} 
any space ":" any space {"} copy content any [ string-char  | "\" skip ] {"} 
(append output rejoin[ {"} your-magic-with key {":"} content {"} ])
; content can be empty -> any, key not -> some

string-char would be a charset with anything except {\} and {"}, syntax?

Don't know if R3 still works like this... :-/

0
On

Since others have answered the parse question, I'll answer the P.S.:

There are a few proposed options that have never been added to replace, and the main reason is that processing options has overhead, and this function has already needed some interesting optimizations to even handle the options it already has. We were going to try to replace the function with a native once we improved its API a little. It's basically a similar situation to the reword function, where we didn't decide on a final API until recently. For replace we haven't even had that discussion yet.

In the case of the /part option, it simply hasn't been suggested by anyone before, and might be a little conceptually awkward to unify with the existing internal length calculations. It would be possible to have a limited /part option, just the integer instead of the offset reference. It would probably be best if the /part length took priority over the internally calculated length. Still, if we end up going with an adjusted API there might be no need for a /part option.