Minimal java file

93 Views Asked by At

I want to compare two .java files and only check if they are identical.

for example i consider the following 2 code blocks as identical

public class A extends B {
private int i;
private int j;
}

public class A extends   B {


private int i;

private int j;

}

Since i don't care if the "compressed" code does still compile I thought of removing all whitespaces and linebreaks and then comparing the files. - Would that lead to false positives? - like is there any circumstance a linebreak could make a difference in how the code works that I can't think of?

Another method I didn't look into yet is parsing the files with javaparser - but no experience with comparing compilation units yet and its possibly slower than the first approach.

3

There are 3 best solutions below

2
On

I thought of removing all whitespaces and linebreaks and then comparing the files. - Would that lead to false positives?

Of course it would.

public int foo() {}
public intf oo() {}

are semantically speaking entirely different beasts, but equal if whitespace is removed. However:

public int foo() {;}
public int foo() {}

are semantically speaking entirely identical. So are:

public int[] foo() {}
public int foo() [] {} // yeah this is legal java syntax.

These are not just semantically identical; most ASTs (the tree-like representation of the source code as emitted by the parser phase of ecj or javac) cannot actually differentiate between these two lines; even syntax-preserving pretty printers will always emit the first of the above 2 even if you write it in the second (admittedly, not stylistically preferred) way.

basic text analysis is never going to get you there. Java syntax is not the kind of syntax that a few regexes and replace operations is going to result in something you can reason about. You need a full parse job.

I see 2 options:

  1. Compile the source files to class files, and compare those. Not just byte for byte, you'd want to ensure the class files contain info you do want (such as param names) but omits info you don't (such as line symbols; presumably you don't care if someone tosses a blank line in a file, but that would modify the linenumber table). But, class files are A LOT simpler to analyse than source files are.

  2. Use ecj or the java grammar of various parser libraries out there and compare the ASTs. This is rather involved, but the only truly correct answer, in that it is by far the most flexible: You can define precisely what is and isn't relevant, for any imaginable syntax variation.

Some major problems with #1 are that there are syntactic differences that do not end up being significant in class files, so you wouldn't be able to tell them apart. That might be more a 'feature' than a 'bug', but you haven't explained why you want to compare java code, so I can't tell. It certainly closes that door: If you go down this path, you won't ever be able to detect any syntactical differences that do not end up in class files, without a complete rewrite of the project. One obvious candidate of 'code that just does not affect the class file': comments. Also, any annotations with RetentionLevel.SOURCE. They just.. disappear, so any class file based comparison system will not be able to tell.

NB: Reducing any whitespace whose bordering characters are both java-identifier-legal to a single space, and reducing any whitespace where one or both bordering characters aren't (so, start/end of file, a parentheses, or bracket, or dash, or dot, etcetera) to nothing would at least be a better approach than a straight up 'strip all whitespace', but it wouldn't be enough for e.g. postfix syntax of array brackets [] on method signatures, blank and otherwise effect-free semicolons in between method signatures, comments, \u escapes in strings, and a ton more things that result in different source code but which are, in pretty much all ways that I can imagine are relevant, 100% equivalent.

1
On

I think there could be below 2 ways

  1. You can remove all the line breaks and whitespaces and than can generate a md5 Hash of both the strings and compare it.

  2. If you dont want to loose on formatting , you can read the file character by character and compare it, while comparing you can ignore if you get some extra space followed by a matching space and try to match next character with current character.

0
On

What does identical mean? If you clarify this requirement, the rest of the problem becomes easier.

Identical means the same non-whitespace characters are in the same order.

Strip all whitespace and then compare. This will make significant strings like " " match strings like "" and "     "

Identical means all non-significant whitespace is ignored.

Stripping the whitespace won't work, you need to parse the file to know which whitespace is replaceable; and, then you can remove or compress the replaceable whitespace with a specific pattern and the compare the files. This means that int getMaxX() { int x = 3; return x; } will compare differently to int getMaxX() { return 3; }

Identical means the code flow is the same.

You can compile the two files, and compare their .class files. This makes it easier to compare logical flows; but, you need to still to take care in cases where the variable names differ. This will fail to show if (x) { doTrue(); } else { doFalse(); } as identical to if (!x) { doFalse(); } else { doTrue(); } because even while the logic is identical, the code flows are different.

Identical means the result of the output is the same.

You can write a set of unit tests that exercise the code under certain conditions; verifying that they produce the same results. This is not a perfect "identical" due to the effort involved, as non-trivial methods cannot be tested exhaustively without an infinite amount of time.

If you are building a code duplication finder, the following algorithm works best.

  1. Start with logic that properly parses the file into an Abstract Syntax Tree.
  2. For each node in the tree, create a hash value based on the nodes "equality" characteristics. (From above, you know that equality is not always absolute as you may wish identical methods with variable renames to match or not.)
  3. Compare the two root nodes (for the two inputs) for hash equality. If they are different, no additional work is necessary, they differ. If they are identical, review the files (as there is a small chance you've used a bad hash algorithm) for false positive matches.

Deciding what is identical can get tricky with the above approach, and one needs to carefully craft the hashing algorithm to make identical hashes of any two items that are considered identical. For example:

hash( conditional (x > 0) ) = 3523
hash( conditional (0 < x) ) = 3523

might be valid for one kind of "identical" but not for another.

The reason this approach works well is because it is relatively easy to specify fine-grained control over your definition of identical; and, the use of hashes makes it easy to search for identical sub-elements.

Now, if you don't want to write this, you can use someone else's definition of identical, by using the Copy Paste Detector for Java https://pmd.github.io/latest/pmd_userdocs_cpd.html