Why does my lexical analyser behave as if there are no line-endings?

99 Views Asked by At

I have coded a Java lexical analyzer down below

Token.java looks like this

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public enum Token {

    TK_MINUS ("-"), 
    TK_PLUS ("\\+"), 
    TK_MUL ("\\*"), 
    TK_DIV ("/"), 
    TK_NOT ("~"), 
    TK_AND ("&"),  
    TK_OR ("\\|"),  
    TK_LESS ("<"),
    TK_LEG ("<="),
    TK_GT (">"),
    TK_GEQ (">="), 
    TK_EQ ("=="),
    TK_ASSIGN ("="),
    TK_OPEN ("\\("),
    TK_CLOSE ("\\)"), 
    TK_SEMI (";"), 
    TK_COMMA (","), 
    TK_KEY_DEFINE ("define"), 
    TK_KEY_AS ("as"),
    TK_KEY_IS ("is"),
    TK_KEY_IF ("if"), 
    TK_KEY_THEN ("then"), 
    TK_KEY_ELSE ("else"), 
    TK_KEY_ENDIF ("endif"),
    OPEN_BRACKET ("\\{"),
    CLOSE_BRACKET ("\\}"),
    

  STRING ("\"[^\"]+\""), 
    TK_FLOAT ("[+-]?([0-9]*[.])?[0-9]+"),
    TK_DECIMAL("(?:0|[1-9](?:_*[0-9])*)[lL]?"),
    TK_OCTAL("0[0-7](?:_*[0-7])*[lL]?"),
    TK_HEXADECIMAL("0x[a-fA-F0-9](?:_*[a-fA-F0-9])*[lL]?"),
    TK_BINARY("0[bB][01](?:_*[01])*[lL]?"),
    IDENTIFIER ("\\w+");

    private final Pattern pattern;

    Token(String regex) {
        pattern = Pattern.compile("^" + regex);
    }

    int endOfMatch(String s) {
        Matcher m = pattern.matcher(s);

        if (m.find()) {
            return m.end();
        }
        return -1;
    }
}

The Lexer class looks like this --> Lexer.java

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.HashSet;
import java.util.Set;
import java.util.stream.Stream;

public class Lexer {
    private StringBuilder input = new StringBuilder();
    private Token token;
    private String lexema;
    private boolean exausthed = false;
    private String errorMessage = "";
    private Set<Character> blankChars = new HashSet<Character>();

    public Lexer(String filePath) {
        try (Stream<String> st = Files.lines(Paths.get(filePath))) {
            st.forEach(input::append);
        } catch (IOException ex) {
            exausthed = true;
            errorMessage = "Could not read file: " + filePath;
            return;
        }

        blankChars.add('\r');
        blankChars.add('\n');
        blankChars.add((char) 8);
        blankChars.add((char) 9);
        blankChars.add((char) 11);
        blankChars.add((char) 12);
        blankChars.add((char) 32);

        moveAhead();
    }

    public void moveAhead() {
        if (exausthed) {
            return;
        }

        if (input.length() == 0) {
            exausthed = true;
            return;
        }

        ignoreWhiteSpaces();

        if (findNextToken()) {
            return;
        }

        exausthed = true;
        

        if (input.length() > 0) {
            errorMessage = "Unexpected symbol: '" + input.charAt(0) + "'";
        }
    }

    private void ignoreWhiteSpaces() {
        int charsToDelete = 0;

        while (blankChars.contains(input.charAt(charsToDelete))) {
            charsToDelete++;
        }

        if (charsToDelete > 0) {
            input.delete(0, charsToDelete);
        }
    }

    private boolean findNextToken() {
        for (Token t : Token.values()) {
            int end = t.endOfMatch(input.toString());

            if (end != -1) {
                token = t;
                lexema = input.substring(0, end);
                input.delete(0, end);
                return true;
            }
        }

        return false;
    }

    public Token currentToken() {
        return token;
    }

    public String currentLexema() {
        return lexema;
    }

    public boolean isSuccessful() {
        return errorMessage.isEmpty();
    }

    public String errorMessage() {
        return errorMessage;
    }

    public boolean isExausthed() {
        return exausthed;
    }
}

And I created a class which can be used to test this lexical analyzer named Try.java

package draft;

public class Try {

    public static void main(String[] args) {
        Lexer lexer = new Lexer("C:/Users/eimom/Documents/Input.txt");

        System.out.println("Lexical Analysis");
        System.out.println("-----------------");
        while (!lexer.isExausthed()) {
            System.out.printf("%-18s :  %s \n",lexer.currentLexema() , lexer.currentToken());
            lexer.moveAhead();
        }

        if (lexer.isSuccessful()) {
            System.out.println("Ok! :D");
        } else {
            System.out.println(lexer.errorMessage());
        }
    }
}

So, lets say the Input.txt file contains

>= 
 0x10
 ()
11001100
 -433
 0125
 0x3B

Then the output I expect is

>=  TK_GEQ
 0x10  TK_HEXADECIMAL
 ( TK_OPEN ,
  )  TK_CLOSE 
11001100 TK_BINARY
 -433 TK_DECIMAL
 0125 TK_OCTAL
 0x3B TK_BINARY

But instead I get

Lexical Analysis
------------------

>                   :TK_GT
=                   :TK_ASSIGN
0                   :TK_FLOAT 
x10                 :IDENTIFIER
(                   :TK_OPEN
)                   :TK_CLOSE
11001100            :TK_FLOAT
-                   :TK_MINUS
43301250            :TK_FLOAT
x3B                 :IDENTIFIER

What can I do to correct these issues? It seems like the code doesn't end at a line, rather it continues and uses the next char on another line.

1

There are 1 best solutions below

0
Mark Rotteveel On

This is your own doing by using Files.lines(Path), the stream of Files.lines contains the content of each line, without line-endings, so when you then combine all your lines back into input, you end up with the file content without linebreaks.

Maybe you want to use Files.readString(Path) instead. I also wonder why you don't use an Reader to read character by character. That is usually much more memory efficient than trying to read the entire file in memory (although that only becomes important if you want to analyse very large files).