Why does my Prolog S-expression tokenizer fail on its base case?

161 Views Asked by At

To learn some Prolog (I'm using GNU Prolog) and grok its parsing abilities, I am starting by writing a Lisp (or S-expression, if I'm being exact) tokenizer, which given a set of tokens like ['(', 'f', 'o', 'o', ')'] should produce ['(', 'foo', ')']. It's not working as expected, which is why I'm here! I thought my thought process shined through in my pseudocode:

tokenize([current | rest], buffer, tokens):
    if current is '(' or ')',
        Tokenize the rest,
        And the output will be the current token buffer,
        Plus the parenthesis and the rest.

    if current is ' ',
        Tokenize the rest with a clean buffer,
        And the output will be the buffer plus the rest.
    
    if the tail is empty,
        The output will be a one-element list containing the buffer.
    
    otherwise,
        Add the current character to the buffer,
        And the output will be the rest tokenized, with a bigger buffer.

I translated that to Prolog like this:

tokenize([Char | Chars], Buffer, Tokens) :-
    ((Char = '(' ; Char = ')') ->
        tokenize(Chars, '', Tail_Tokens),
        Tokens is [Buffer, Char | Tail_Tokens];
    Char = ' ' ->
        tokenize(Chars, '', Tail_Tokens),
        Tokens is [Buffer | Tail_Tokens];

    Chars = [] -> Tokens is [Buffer];

    atom_concat(Buffer, Char, New_Buffer),
    tokenize(Chars, New_Buffer, Tokens)).

print_tokens([]) :- write('.').
print_tokens([T | N]) :- write(T), write(', '), print_tokens(N).

main :-
    % tokenize(['(', 'f', 'o', 'o', '(', 'b', 'a', 'r', ')', 'b', 'a', 'z', ')'], '', Tokens),
    tokenize(['(', 'f', 'o', 'o', ')'], '', Tokens),
    print_tokens(Tokens).

When running the result, below, like this: gprolog --consult-file lisp_parser.pl it just tells me no. I traced main, and it gave me the stack trace below. I do not understand why tokenize fails for the empty case. I see that the buffer is empty since it was cleared with the previous ')', but even if Tokens is empty at that point in time, wouldn't Tokens accumulate a larger result recursively? Can someone who is good with Prolog give me a few tips here?

| ?- main.

no
| ?- trace.
The debugger will first creep -- showing everything (trace)

(1 ms) yes
{trace}
| ?- main.
      1    1  Call: main ? 
      2    2  Call: tokenize(['(',f,o,o,')'],'',_353) ? 
      3    3  Call: tokenize([f,o,o,')'],'',_378) ? 
      4    4  Call: atom_concat('',f,_403) ? 
      4    4  Exit: atom_concat('',f,f) ? 
      5    4  Call: tokenize([o,o,')'],f,_429) ? 
      6    5  Call: atom_concat(f,o,_454) ? 
      6    5  Exit: atom_concat(f,o,fo) ? 
      7    5  Call: tokenize([o,')'],fo,_480) ? 
      8    6  Call: atom_concat(fo,o,_505) ? 
      8    6  Exit: atom_concat(fo,o,foo) ? 
      9    6  Call: tokenize([')'],foo,_531) ? 
     10    7  Call: tokenize([],'',_556) ? 
     10    7  Fail: tokenize([],'',_544) ? 
      9    6  Fail: tokenize([')'],foo,_519) ? 
      7    5  Fail: tokenize([o,')'],fo,_468) ? 
      5    4  Fail: tokenize([o,o,')'],f,_417) ? 
      3    3  Fail: tokenize([f,o,o,')'],'',_366) ? 
      2    2  Fail: tokenize(['(',f,o,o,')'],'',_341) ? 
      1    1  Fail: main ? 

(1 ms) no
{trace}
| ?- 
2

There are 2 best solutions below

4
On BEST ANSWER

How about this. I think that's what you want to do, but let's use Definite Clause Grammars (which are just horn clauses with :- replaced by --> and two elided arguments holding the input character list and remaining character list. An example DCG rule:

rule(X) --> [c], another_rule(X), {predicate(X)}.

List processing rule rule//1 says: When you find character c in the input list, then continue list processing with another_rule//1, and when that worked out, call predicate(X) as normal.

Then:

% If we encounter a separator symbol '(' or ')', we commit to the
% clause using '!' (no point trying anything else, in particular
% not the clause for "other characters", tokenize the rest of the list,
% and when we have done that decide whether 'MaybeToken', which is 
% "part of the leftmost token after '(' or ')'", should be retained.
% it is dropped if it is empty. The caller is then given an empty
% "part of the leftmost token" and the list of tokens, with '(' or ')'
% prepended: "tokenize('', [ '(' | MoreTokens] )  -->"
 
tokenize('', [ '(' | MoreTokens] ) -->
   ['('],
   !,
   tokenize(MaybeToken,Tokens),
   {drop_empty(MaybeToken,Tokens,MoreTokens)}.
   
tokenize('',[')'|MoreTokens]) --> 
   [')'],
   !,
   tokenize(MaybeToken,Tokens),
   {drop_empty(MaybeToken,Tokens,MoreTokens)}.
   
% No more characters in the input list (that's what '--> []' says).
% We succeed, with an empty token list and an empty buffer fro the
% leftmost token.

tokenize('',[]) --> [].

% If we find a 'Ch' that is not '(' or ')', then tokenize
% more of the list via 'tokenize(MaybeToken,Tokens)'. On
% returns 'MaybeToken' is a piece of the leftmost token found
% in that list, so we have to stick 'Ch' onto its start.

tokenize(LargerMaybeToken,Tokens) --> 
   [Ch],
   tokenize(MaybeToken,Tokens),
   {atom_concat(Ch,MaybeToken,LargerMaybeToken)}.

% ---
% This drops an empty "MaybeToken". If "MaybeToken" is 
% *not* empty, it is actually a token and prepended to the list "Tokens"
% ---

drop_empty('',Tokens,Tokens) :- !.
drop_empty(MaybeToken,Tokens,[MaybeToken|Tokens]).

% -----------------
% Call the DCG using phrase/2
% -----------------

tokenize(Text,Result) :-
   phrase( tokenize(MaybeToken,Tokens), Text ),
   drop_empty(MaybeToken,Tokens,Result),!.

And so:

?- tokenize([h,e,l,l,o],R).
R = [hello].

?- tokenize([h,e,l,'(',l,')',o],R).
R = [hel,(,l,),o].

?- tokenize([h,e,l,'(',l,l,')',o],R).
R = [hel,(,ll,),o].

I think in GNU Prolog, the notation `hello` generates [h,e,l,l,o] directly.

0
On

I do not understand why tokenize fails for the empty case.

The reason anything fails in Prolog is because there is no clause that makes it true. If your only clause for tokenize is of the form tokenize([Char | Chars], ...), then no call of the form tokenize([], ...) will ever be able to match this clause, and since there are no other clauses, the call will fail.

So you need to add such a clause. But first:

:- set_prolog_flag(double_quotes, chars).

This allows you to write ['(', f, o, o, ')'] as "foo".

Also, you must plan for the case where the input is completely empty, or other cases where you must maybe emit a token for the buffer, but only if it is not '' (since there should be no '' tokens littering the result).

finish_buffer(Tokens, Buffer, TokensMaybeWithBuffer) :-
    (   Buffer = ''
    ->  TokensMaybeWithBuffer = Tokens
    ;   TokensMaybeWithBuffer = [Buffer | Tokens] ).

For example:

?- finish_buffer(MyTokens, '', TokensMaybeWithBuffer).
MyTokens = TokensMaybeWithBuffer.

?- finish_buffer(MyTokens, 'foo', TokensMaybeWithBuffer).
TokensMaybeWithBuffer = [foo|MyTokens].

Note that you can prepend the buffer to the list of tokens, even if you don't yet know what that list of tokens is! This is the power of logical variables. The rest of the code uses this technique as well.

So, the case for the empty input:

tokenize([], Buffer, Tokens) :-
    finish_buffer([], Buffer, Tokens).

For example:

?- tokenize([], '', Tokens).
Tokens = [].

?- tokenize([], 'foo', Tokens).
Tokens = [foo].

And the remaining cases:

tokenize([Parenthesis | Chars], Buffer, TokensWithParenthesis) :-
    (   Parenthesis = '('
    ;   Parenthesis = ')' ),
    finish_buffer([Parenthesis | Tokens], Buffer, TokensWithParenthesis),
    tokenize(Chars, '', Tokens).
tokenize([' ' | Chars], Buffer, TokensWithBuffer) :-
    finish_buffer(Tokens, Buffer, TokensWithBuffer),
    tokenize(Chars, '', Tokens).
tokenize([Char | Chars], Buffer, Tokens) :-
    Char \= '(',
    Char \= ')',
    Char \= ' ',
    atom_concat(Buffer, Char, NewBuffer),
    tokenize(Chars, NewBuffer, Tokens).

Note how I used separate clauses for the separate cases. This makes the code more readable, but it does have the drawback compared to (... -> ... ; ...) that the last clause must exclude characters handled by previous clauses. Once you have your code in this shape, and you're happy that it works, you can transform it into a form using (... -> ... ; ...) if you really want to.

Examples:

?- tokenize("(foo)", '', Tokens).
Tokens = ['(', foo, ')'] ;
false.

?- tokenize(" (foo)", '', Tokens).
Tokens = ['(', foo, ')'] ;
false.

?- tokenize("(foo(bar)baz)", '', Tokens).
Tokens = ['(', foo, '(', bar, ')', baz, ')'] ;
false.

Finally, and very importantly, the is operator is meant only for evaluation of arithmetic expressions. It will throw an exception when you apply it to anything that is not arithmetic. Unification is different from the evaluation of arithmetic expression. Unification is written as =.

?- X is 2 + 2.
X = 4.

?- X = 2 + 2.
X = 2+2.

?- X is [a, b, c].
ERROR: Arithmetic: `[a,b,c]' is not a function
ERROR: In:
ERROR:   [20] throw(error(type_error(evaluable,...),_3362))
ERROR:   [17] arithmetic:expand_function([a,b|...],_3400,_3402) at /usr/lib/swi-prolog/library/arithmetic.pl:175
ERROR:   [16] arithmetic:math_goal_expansion(_3450 is [a|...],_3446) at /usr/lib/swi-prolog/library/arithmetic.pl:147
ERROR:   [14] '$expand':call_goal_expansion([system- ...],_3512 is [a|...],_3492,_3494,_3496) at /usr/lib/swi-prolog/boot/expand.pl:863
ERROR:   [13] '$expand':expand_goal(_3566 is [a|...],_3552,_3554,_3556,user,[system- ...],_3562) at /usr/lib/swi-prolog/boot/expand.pl:524
ERROR:   [12] setup_call_catcher_cleanup('$expand':'$set_source_module'(user,user),'$expand':expand_goal(...,_3640,_3642,_3644,user,...,_3650),_3614,'$expand':'$set_source_module'(user)) at /usr/lib/swi-prolog/boot/init.pl:443
ERROR:    [8] '$expand':expand_goal(user:(_3706 is ...),_3692,user:_3714,_3696) at /usr/lib/swi-prolog/boot/expand.pl:458
ERROR:    [6] setup_call_catcher_cleanup('$toplevel':'$set_source_module'(user,user),'$toplevel':expand_goal(...,...),_3742,'$toplevel':'$set_source_module'(user)) at /usr/lib/swi-prolog/boot/init.pl:443
ERROR: 
ERROR: Note: some frames are missing due to last-call optimization.
ERROR: Re-run your program in debug mode (:- debug.) to get more detail.
^  Call: (14) call('$expand':'$set_source_module'(user)) ? abort
% Execution Aborted
?- X = [a, b, c].
X = [a, b, c].