So I am trying to find a common identifier for journals using dois. For example, I have a list of dois for a journal: ['10.1001/jamacardio.2016.5501', '10.1001/jamacardio.2017.3145', '10.1001/jamacardio.2018.3029', '10.1001/jamacardio.2020.5573', '10.1001/jamacardio.2020.0647']
(The list is much longer than this)
I want to find the longest common substring in my list. I have tried SequenceMatcher but can only look for similarity between 2 strings.
journal_list
def longestSubstring(str1,str2):
#initialize SequenceMatcher object with
#input string
seqMatch = SequenceMatcher(None,str1,str2)
#find match of longest sub-string
#output will be like Match(a=0, b=0, size=5)
match = seqMatch.find_longest_match(0, len(str1), 0, len(str2))
if (match.size!=0):
print (str1[match.a: match.a + match.size])
else:
print ('No longest common sub-string found')
for journal in journal_list:
str1 = journal_list[1]
print(longestSubstring(str1,journal))
Expected output:
'10.1001/jamacardio.20'
![suffix tree of ["ABAB", "BABA", "ABBA"]](https://i.stack.imgur.com/tJ2NU.png)
I think it's overkill to use any fancy matching library for this and would start with a function that works with two strings:
Then just apply this repeatedly to all the strings:
This only finds the common prefix; if you want general substrings, just modify
common_2.