So I am trying to find a common identifier for journals using dois. For example, I have a list of dois for a journal: ['10.1001/jamacardio.2016.5501', '10.1001/jamacardio.2017.3145', '10.1001/jamacardio.2018.3029', '10.1001/jamacardio.2020.5573', '10.1001/jamacardio.2020.0647']
(The list is much longer than this)
I want to find the longest common substring in my list. I have tried SequenceMatcher but can only look for similarity between 2 strings.
journal_list
def longestSubstring(str1,str2):
#initialize SequenceMatcher object with
#input string
seqMatch = SequenceMatcher(None,str1,str2)
#find match of longest sub-string
#output will be like Match(a=0, b=0, size=5)
match = seqMatch.find_longest_match(0, len(str1), 0, len(str2))
if (match.size!=0):
print (str1[match.a: match.a + match.size])
else:
print ('No longest common sub-string found')
for journal in journal_list:
str1 = journal_list[1]
print(longestSubstring(str1,journal))
Expected output:
'10.1001/jamacardio.20'
According to https://en.wikipedia.org/wiki/Longest_common_substring#Suffix_tree, you can use a suffix tree to solve the problem.
The idea is as follows:
If we can try every combination of suffixes, the substring(s) will be the longest. However, implementing this algorithm using loops has a poor performance. That's why suffix tree (a trie storing suffixes) is important. If you are unfamiliar with trie, here is its wiki: https://en.wikipedia.org/wiki/Trie. In short, trie is famous for proceeding prefixes, and by inserting suffixes, the prefixes becomes general substrings.
Suppose original string list is called
list
. When we insert all suffixes oflist[i]
, we attach informationi
into visited nodes. When a node has alli
from0
tolen(list) - 1
, we can say the node is full. Of course, the information can be implemented with sets. Now the task becomes finding the longest full sequence(s) of the suffix tree.Back to your problem: finding a common identifier for journals using dois. A doi is not that long, so generating substrings can be accomplished in expected time (although it's relatively long). And the following example doesn't consider Unicode characters, but I doubt they'll appear in dois.
Python code (I'm not an expert in Python, but you can regroup these functions and statements into a class for your usages):
There is a pruning step in
insert()
, which essentially converts inserting all suffixes of dois into inserting only the suffixes of the first doi. So the size of the suffix tree doesn't grow in proportion tolen(dois)
.To visualize the suffix tree, you can use Graphviz:
Note that visualization only works when
list
(dois
in the code) is small and strings are not too long.dot
can freeze if the input is too large. And even if it completes, you cannot really analyze such a large graph easily. When I changeddois
to["ABAB", "BABA", "ABBA"]
(and without the pruing step), the visualized suffix tree was like:All full sequences are marked as red, and
["AB", "BA"]
is the result. Your original example can be visualized as well, but the image is way too large to display here.BTW, there is another question similar to this one, and the answers there mostly involves dynamic programming: Longest common substring from more than two strings. I'd admit DP is more precise.
Edit: The DP solution in the link runs much faster than this suffix tree solution. I think that's the real answer.