Optimization for Comparing HashMap Elements while Iterating

234 Views Asked by At

I have a csv file with names nearly 845k line.

I want to compare fuzzy name string matching. I used Java fuzzy string matching implementation of the well known Python's fuzzywuzzy algorithm.

Implemented below code it works perfect for me. The Problem is process time to much. Every line compare time is nearly 15 sec with other lines. This is 240 line for an hour and whole process will be nearly 6000 row. And all process will be finish in months. This is unacceptable working time.

I need an optimization technique or method. I need some suggestion rather than solution.

What you suggest for below code?

BufferedReader br = new BufferedReader(new FileReader("data/names.csv"));
BufferedWriter bw = new BufferedWriter(new FileWriter("data/similars.csv"));
ConcurrentHashMap<Integer,String> map = new ConcurrentHashMap<Integer,String>();

String lines;
while( (lines = br.readLine()) != null ){
    String[] line = lines.split("\\t",-1);
    Integer nameId = Integer.parseInt(line[0]);
    String name = line[1];
    map.put(nameId, name);
}

for (Map.Entry<Integer, String> entry1 : map.entrySet()) {
    Integer nameId1 = entry1.getKey();
    String name1 = entry1.getValue();

    for (Map.Entry<Integer, String> entry2 : map.entrySet()) {
        Integer nameId2 = entry2.getKey();
        if (nameId1 == nameId2) {
            continue;
        }
        String name2 = entry2.getValue();
        int ratio = FuzzySearch.ratio(name1,name2);
        if(ratio > 95){
            bw.write(nameId1 + "," + nameId2 + "\n");
        }
    }
    // For to prevent matching same pairs again 
    map.remove(nameId1);
}
1

There are 1 best solutions below

11
On
  1. You can try instead Levenshtein distance algorithm, maybe it will give you better performance. Or try any other algorithm. Provide a link to algoritm implementation
  2. It's better not to compare Integer with ==, use nameId1.intValue() == nameId2
  3. Create N threads, where N is the number of cores. Put all your lines in the ConcurrentLinkedQueue. Let you threads poll the queue, take a word, make compassion, once finished - write to file under synchronized section. It will allow you to use all your cores on your PC, not only 1
  4. Why it takes so much time, maybe you have some memory restriction, which force GC to eat your CPU cycles and affect performance.
  5. You can apply some minor optimization, let's say if different between words length is more then 50%, you will never get 95% match
  6. Take a look at this implementation they are using threshold value, I believe it will give you max boost, I think algoritm will return earlier if distance is greater than threshold. Also check this question