Why Diff-match-patch broken linediff beyond 65K lines

1k Views Asked by At

I try using The google diff-match-path library for line diffs: https://github.com/google/diff-match-patch/wiki/Line-or-Word-Diffs. I get wrong patches when in sum the lines of both inputs goes beyond 65,536 (2^16) lines.

Is that a bug (in my code or diff-match-patch), or am I hitting a known limitation of javascript/nodejs? Anything I can do to use d-m-p with larger files?

Using node version v6.3.1, diff-match-patch 1.0.4

This script reproduces the problem

var diff_match_patch = require("diff-match-patch")

// function copied from google wiki 
// https://github.com/google/diff-match-patch/wiki/Line-or-Word-Diffs
function diff_lineMode(text1, text2) {
  var dmp = new diff_match_patch();
  var a = dmp.diff_linesToChars_(text1, text2);
  var lineText1 = a.chars1;
  var lineText2 = a.chars2;
  var lineArray = a.lineArray;
  var diffs = dmp.diff_main(lineText1, lineText2, false);
  dmp.diff_charsToLines_(diffs, lineArray);
  return diffs;
}

// reproduce problem by diffing string with many lines to "abcd"
for (let size = 65534; size < 65538; size += 1) {
  let text1 = "";
  for (let i = 0; i < size; i++) {
    text1 += i + "\n";
  }

  var patches = diff_lineMode(text1, "abcb")
  console.log("######## Size: " + size + ": patches " + patches.length)
  for (let i = 0; i < patches.length; i++) {
    // patch[0] is action, patch[1] is value
    var action = patches[i][0] < 0 ? "remove" : (patches[i][0] > 0 ? "add" : "keep")
    console.log("patch" + i + ": " + action + "\n" + patches[i][1].substring(0, 10))
  }
}

Giving these outputs:

######## Size: 65534: patches 2
patch0: remove
0
1
2
3
4

patch1: add
abcb
######## Size: 65535: patches 2
patch0: remove
0
1
2
3
4

patch1: add

######## Size: 65536: patches 2
patch0: keep
0

patch1: remove
1
2
3
4
5

######## Size: 65537: patches 3
patch0: remove
0

patch1: keep
1

patch2: remove
2
3
4
5
6
1

There are 1 best solutions below

0
On

It's a limitation from ES5 and the algorithm mapping lines to 16bit unicode characters. On ES6, it can be extended to 2^21 bit instead, covering longer files.

To speed up line-diffing, the algorithm does not compare the whole texts, but replaces each line with a single unicode character. So each character in the replacement maps to one unique line in a hashmap. The number of unicode characters however is limited, and the current implementation merely overflows.

This will not cause false positives (same lines will still be considered same), but it may miss some line differences at a low probability of 1/65K per line for natural diffs.

And it prevents the patches to be mapped back to the original text lines reliably, because different lines were mapped to the same character, so the inverse process maps all such chars to the first mapped line.

It should be possible to scale correct diffing to much larger inputs by using a larger target space of symbols, such as by using 2 or 3 characters to represent unique lines.