I want to compare two .java files and only check if they are identical.
for example i consider the following 2 code blocks as identical
public class A extends B {
private int i;
private int j;
}
public class A extends B {
private int i;
private int j;
}
Since i don't care if the "compressed" code does still compile I thought of removing all whitespaces and linebreaks and then comparing the files. - Would that lead to false positives? - like is there any circumstance a linebreak could make a difference in how the code works that I can't think of?
Another method I didn't look into yet is parsing the files with javaparser - but no experience with comparing compilation units yet and its possibly slower than the first approach.
Of course it would.
are semantically speaking entirely different beasts, but equal if whitespace is removed. However:
are semantically speaking entirely identical. So are:
These are not just semantically identical; most ASTs (the tree-like representation of the source code as emitted by the parser phase of ecj or javac) cannot actually differentiate between these two lines; even syntax-preserving pretty printers will always emit the first of the above 2 even if you write it in the second (admittedly, not stylistically preferred) way.
basic text analysis is never going to get you there. Java syntax is not the kind of syntax that a few regexes and replace operations is going to result in something you can reason about. You need a full parse job.
I see 2 options:
Compile the source files to class files, and compare those. Not just byte for byte, you'd want to ensure the class files contain info you do want (such as param names) but omits info you don't (such as line symbols; presumably you don't care if someone tosses a blank line in a file, but that would modify the linenumber table). But, class files are A LOT simpler to analyse than source files are.
Use ecj or the java grammar of various parser libraries out there and compare the ASTs. This is rather involved, but the only truly correct answer, in that it is by far the most flexible: You can define precisely what is and isn't relevant, for any imaginable syntax variation.
Some major problems with #1 are that there are syntactic differences that do not end up being significant in class files, so you wouldn't be able to tell them apart. That might be more a 'feature' than a 'bug', but you haven't explained why you want to compare java code, so I can't tell. It certainly closes that door: If you go down this path, you won't ever be able to detect any syntactical differences that do not end up in class files, without a complete rewrite of the project. One obvious candidate of 'code that just does not affect the class file': comments. Also, any annotations with
RetentionLevel.SOURCE
. They just.. disappear, so any class file based comparison system will not be able to tell.NB: Reducing any whitespace whose bordering characters are both java-identifier-legal to a single space, and reducing any whitespace where one or both bordering characters aren't (so, start/end of file, a parentheses, or bracket, or dash, or dot, etcetera) to nothing would at least be a better approach than a straight up 'strip all whitespace', but it wouldn't be enough for e.g. postfix syntax of array brackets
[]
on method signatures, blank and otherwise effect-free semicolons in between method signatures, comments,\u
escapes in strings, and a ton more things that result in different source code but which are, in pretty much all ways that I can imagine are relevant, 100% equivalent.