Java based Syntax Check / Rule based toolset

436 Views Asked by At

We have been given a project in which we have to accept a set of large text files with very specific requirements, ~150-200 rules. Each rule can pass, fail, not applicable. The pass fail can be the existence or lack of a matching regex. Some rules would be multi-line (i.e if "X" exists, then the following three lines should also exist and they should contain 1, 2, and 3).

Although the entire thing can be written with very hard to read regex code .. and that with each rule the entire file has to be re-read again, I figured I would ask the community if there is another choice?

I have looked at openrules, drools, etc.. and none of them would be able to make it any easier than just writing a huge set of regexes in a list and applying each one to the text file.

4

There are 4 best solutions below

0
On BEST ANSWER

I don't see any way you can entirely avoid writing regular expressions and apply them to the lines of those text files. (There's no indication as to an overall grammar defining the configuration file data. Writing a parser according to that grammar would - probably - be a cinch. No chance?)

I see two problems you have to solve. One is the recognition of certain keywords (such as 'hostname'), the other one is the presence or absence of certain patterns depending on one or more previous lines.

To solve the first problem, I would (use Java code to) break lines into space-separated tokens, so that each line becomes List.

The second problem can be attacked using rules.

rule "hostname"
when
  Line( $n: number, $tok: tokens contains "hostname" )
  eval( $tok.get( $tok.indexOf( "hostname" ) + 1 ).length() > 4 ) // incomplete
then
  insert( new Correct( $n, "hostname" ) );
end

(Note that the boolean expression would have to guard against $tok ending with "hostname".) Inserting facts for correct data is easier than writing rule for all failing situations. At the end there will be another set of rules that check that all required Correct facts are in the Working Memory. Also, it may be necessary to check against duplicate "hostname" definition, which can be done easily using the Correct fact.

Let's look at the other example as well.

rule "interface"
when
    Line( $n1: number, $tok: tokens contains "interface" )
    Line( number == $n1 + 1, tokens not contains "disabled" )
    Line( number == $n1 + 2,
       tokens not contains "parameter" ||
       tokens contains "parameter" && $tok.indexOf( "parameter" ) < $tok.size() - 1 )
then
    insert( new Error( $n1, "interface configuration error" ) );
end

Could be that $tok.indexOf( "parameter" ) == 1 and $tok.size() == 2 is required, but not knowing the exact nature of those requirements... Here I'm inserting a negative result, also for collecting it at the end, sorted by line numbers, etc.

A final note: I have the feeling that the wording of these validation requirements is much too hazy, unless you are confident that a more stringent processor is capable of dealing with poor syntax, or the specs are actually tolerating weird phrasing, such as, e.g. "hostname saturn without his rings ;-)" Would this be a correct line? It passes the test according your rule...

2
On

Another consideration is to have the source of data send you the data as an xml file with a schema that you can validate against. If it passes validation, then you can parse the xml file and apply any further regexp rules to the data within the tags that the schema can't specify. If the validator doesn't like the data, it will tell you what line it doesn't like (you can get a validator off-the-shelf, you don't have to code your own validator). This way, the source of the data can ensure their program is generating valid data before they send it out to who knows how many customers, and your program can verify you aren't accepting bad data. I know its probably not applicable to your current situation (having to accept a text file rather than an xml file), but its something to consider for future designs.

1
On

It is not very clear to me what kind of syntax you are telling us about, but if you think it is too complicated for a set of regular expressions then perhaps you should write a proper parser for the syntax using, for example, the excellent ANTLR.

2
On

If the "text" has any regularity, you should be able to define a grammar for that structure, and then use classic parsing techniques to check that the text adheres to the structure, and classic semantic analysis techniques to verify that structured text has additional desired properties (e.g., your "has 3 lines containing values 1 2 3").

Using a grammar will also let you easily express constraints that regular expressions simply cannot do (e.g., "all the left parentheses have corresponding right parentheses", "structure A is contained inside structure B inside C", ...) This is the point of "context-free" (grammars) vs. "regular" languages (which is what regexes can recognize).

Finally, using a grammar and a parser means you don't have to run the regexps individually over the file. Good parser generators will combine the grammar rules into an efficient engine that will pick out the patterns in a single pass, no matter how many grammar rules you have.