I have ~13k Strings in a csv file. My program is reading this file, getting the distance between those strings and outputs that to another file.
Currently this is all done in a single threaded function, but that is very slow. I need to improve the efficiency of said function. My best guess was to begin with making it run parallel across multiple threads.
I don't know much about parallelism and have a hard time understanding it, more certainly how I need to modify my code to work with in that way.
I am trying to do it in Groovy, but if you know a better language for that please tell me. This is the function (still single threaded) I am trying to convert to an parallel one:
long startTime = new Date().getTime()
long calcTime = 0
def outputList = []
for (int i = 0; i < records.size(); i++) {
long currTime = new Date().getTime()
int matchIndex = -1
int matchDistance = -1
if (i % 50 == 0) {
println("Status: ${df.format(i / (records.size() - 1) * 100)} % done. [${i}/${records.size() - 1}] (Last calcTime: ${df.format(calcTime / 1000.0)}s // ${calcTime}ms) (outputList.size(): ${outputList.size()})")
}
for (int j = 0; j < outputList.size(); j++) {
String s1 = "" + records[i][2]
String s2 = "" + outputList[j][2]
int distance = StringUtils.getLevenshteinDistance(s1, s2)
if (distance <= 10 && distance> matchDistance) {
matchIndex = j
matchDistance = distance
}
}
calcTime = new Date().getTime() - currTime
outputList += [records[i] + [matchIndex, matchDistance]]
}
StringUtils is from the Apache Commons library. records is an 2 dimensional array, extracted from an CSV File. (Index [x][2] is the String I want to compare, same goes for outputList)
I've read about the GPars Library and am trying to do it that way, but as I've said I have a really hard time understanding how it works.
I would really appreciate if you could tell and explain to me how you would solve this problem, or link me resources to help me understand it.
EDIT: Here are 5 lines of the input csv file:
server.log.2021.10.29|139712|2021-10-29 15:23:34,672 WARN [ xxx.groovy] [xxx:nimh:pipe--] NIMH HTML Tags in Mail with MIME Type text/plain detected. Skipping HTML Link creation⦀
server.log.2021.10.29|139713|2021-10-29 15:23:49,546 WARN [ xxx.groovy] [xxx:nimh:pipe--] NIMH Incoming Mail Routing> Admintool Template xxx.csv contains wrong line: 16⦀2021-10-29 15:23:49,546 WARN [ xxx.groovy] [xxx:nimh:pipe--] NIMH Incoming Mail Routing> Non Standard Pattern Template lines must contain ###number### pattern and at least 3 more non blank signs.⦀2021-10-29 15:23:49,546 WARN [ xxx.groovy] [xxx:nimh:pipe--] NIMH Incoming Mail Routing> Admintool Template xxx.csv contains wrong line: 17⦀2021-10-29 15:23:49,546 WARN [ xxx.groovy] [xxx:nimh:pipe--] NIMH Incoming Mail Routing> Non Standard Pattern Template lines must contain ###number### pattern and at least 3 more non blank signs.⦀2021-10-29 15:23:49,546 WARN [ xxx.groovy] [xxx:nimh:pipe--] NIMH Incoming Mail Routing> Admintool Template xxx.csv contains wrong line: 25⦀2021-10-29 15:23:49,546 WARN [ xxx.groovy] [xxx:nimh:pipe--] NIMH Incoming Mail Routing> Non Standard Pattern Template lines must contain ###number### pattern and at least 3 more non blank signs.⦀
server.log.2021.10.29|139841|2021-10-29 15:23:50,018 WARN [ xxx.groovy] [xxx:nimh:pipe--] NIMH HTML Tags in Mail with MIME Type text/plain detected. Skipping HTML Link creation⦀
server.log.2021.10.29|139855|2021-10-29 15:24:04,701 WARN [ xxx.groovy] [xxx:nimh:pipe--] NIMH HTML Tags in Mail with MIME Type text/plain detected. Skipping HTML Link creation⦀
server.log.2021.10.29|140031|2021-10-29 15:24:08,435 WARN [ice.aspect.ScriptMetricsAspect] [xxx:nimh:pipe--] Execution of script: xxx.groovy took 3 seconds⦀