Longest common substring is a well-known problem. I'm interested in identifying common substrings in a set of strings that is growing over time.
Use case: suppose I'm emitting log messages inside a tight loop. Many logging libraries offer a flag for suppressing duplicate messages, but what about similar messages? Contrived example:
for file in files {
if ! process_file(file) {
warn!("Failed to process {}.", file);
}
}
If files has, say O(1) elements, this is all well & good. If it has O(10^6) elements, I'd like to see something like this in the log files:
Failed to process abc000000.txt.
Failed to process abc000001.txt.
Failed to process abc000002.txt.
[Repeated 912345 times]
in my log file.
I'm picturing a logging implementation that keeps track of longest common substring between the current & last log messages. If it spikes, it begins tracking all subsequent log messages. If more than n consecutive log messages display similar longest common substrings, it stops emitting them & starts a count.
Any prior art, here?