So I am trying to find a common identifier for journals using dois. For example, I have a list of dois for a journal: ['10.1001/jamacardio.2016.5501', '10.1001/jamacardio.2017.3145', '10.1001/jamacardio.2018.3029', '10.1001/jamacardio.2020.5573', '10.1001/jamacardio.2020.0647']

(The list is much longer than this)

I want to find the longest common substring in my list. I have tried SequenceMatcher but can only look for similarity between 2 strings.

 journal_list
 
 def longestSubstring(str1,str2):
  
     #initialize SequenceMatcher object with 
     #input string
     seqMatch = SequenceMatcher(None,str1,str2)
  
     #find match of longest sub-string
     #output will be like Match(a=0, b=0, size=5)
     match = seqMatch.find_longest_match(0, len(str1), 0, len(str2))
  
     if (match.size!=0):
         print (str1[match.a: match.a + match.size]) 
     else:
         print ('No longest common sub-string found')
 
     for journal in journal_list:
         str1 = journal_list[1]
 
 print(longestSubstring(str1,journal))

Expected output:

'10.1001/jamacardio.20'
3

There are 3 best solutions below

0
On

According to https://en.wikipedia.org/wiki/Longest_common_substring#Suffix_tree, you can use a suffix tree to solve the problem.

The idea is as follows:

  1. Choose one suffix from each string.
  2. Find the longest common prefix of these suffixes.
  3. If the prefix is longer than original records, replace them. Or:
  4. If the prefix has the same length as original records, append it into the records.

If we can try every combination of suffixes, the substring(s) will be the longest. However, implementing this algorithm using loops has a poor performance. That's why suffix tree (a trie storing suffixes) is important. If you are unfamiliar with trie, here is its wiki: https://en.wikipedia.org/wiki/Trie. In short, trie is famous for proceeding prefixes, and by inserting suffixes, the prefixes becomes general substrings.

Suppose original string list is called list. When we insert all suffixes of list[i], we attach information i into visited nodes. When a node has all i from 0 to len(list) - 1, we can say the node is full. Of course, the information can be implemented with sets. Now the task becomes finding the longest full sequence(s) of the suffix tree.

Back to your problem: finding a common identifier for journals using dois. A doi is not that long, so generating substrings can be accomplished in expected time (although it's relatively long). And the following example doesn't consider Unicode characters, but I doubt they'll appear in dois.

Python code (I'm not an expert in Python, but you can regroup these functions and statements into a class for your usages):

from dataclasses import dataclass, field


@dataclass
class Node:
    char: str
    layer: int
    parent: "Node"
    # sources: set = field(default_factory=set)
    # By using an integer counter we can save ~40% time
    sources: int = 0
    children: dict = field(default_factory=dict)


dois = [
    '10.1001/jamacardio.2016.5501', '10.1001/jamacardio.2017.3145',
    '10.1001/jamacardio.2018.3029', '10.1001/jamacardio.2020.5573',
    '10.1001/jamacardio.2020.0647'
]

# Sort the input so the first doi has fewer substrings
dois.sort(key=len)

def full(node):
    global dois
    # return len(node.sources) == len(dois)
    return node.sources == len(dois)


def find_nodes(root):
    """
    Find ending nodes of full-node sequences.
    Since nodes have the property `parent`, it would be easy to trace back,
    and saving nodes is more efficient than building strings.
    """
    results = []
    maxh = 0

    def dfs(node):
        nonlocal results, maxh
        '''
        We can expect the full sequences to start from the top of the suffix 
        tree. If not, since an abnormal sequence is a suffix of other suffixes, 
        it conflicts with the fact that all possible suffixes have been inserted.
        '''
        if not full(node) and node.layer > 0:
            return
        if node.layer > maxh:
            maxh = node.layer
            results = []
        if node.layer == maxh and maxh > 0:
            results.append(node)

        for next in node.children.values():
            dfs(next)

    dfs(root)
    return results


def build_string(node):
    '''
    Get expected strings from ending nodes.
    '''
    s = ''
    cur = node
    while cur != None and full(cur):
        s += cur.char
        cur = cur.parent
    # Reverse s. Weird that `str(reversed(s))` doesn't work
    return ''.join(reversed(s))


def insert(root, s, source):
    cur = root
    for i in range(len(s)):
        ch = s[i]
        if ch not in cur.children:
            cur.children[ch] = Node(ch, i + 1, cur)
        cur = cur.children[ch]
        # cur.sources.add(source)
        if cur.sources == source:
            cur.sources += 1
        # All following nodes won't be full.
        # This early return saves another 55% time.
        elif cur.sources < source:
            return


root = Node(0, 0, None)

# Insert all suffixes of dois into the tree.
for i in range(len(dois)):
    doi = dois[i]
    for j in range(len(doi)):
        insert(root, doi[j:], i)
        
results = find_nodes(root)

# Transform nodes to strings.
for i in range(len(results)):
    results[i] = build_string(results[i])

print(results)

There is a pruning step in insert(), which essentially converts inserting all suffixes of dois into inserting only the suffixes of the first doi. So the size of the suffix tree doesn't grow in proportion to len(dois).

To visualize the suffix tree, you can use Graphviz:

def toGraphviz(root):
    s = 'digraph G {\n  overlap=scale\n  node [style=filled shape=circle]\n'
    stack = []
    stack.append(root)
    while len(stack) > 0:
        node: Node = stack.pop(len(stack) - 1)
        s += f'  {id(node)} [label=""]'
        for k in node.children:
            v = node.children[k]
            s += f'  "{id(node)}" -> "{id(v)}" [label="{k}" {"color=red penwidth=2.0" if full(v) else ""}]\n'
            stack.append(v)
    s += '}'
    return s

# Graphviz should be installed first: https://graphviz.org/download/
# Use `dot tree.dot -Tsvg -o tree.svg` to render a svg file.
with open('tree.dot', 'w') as f:
    f.write(toGraphviz(root))

Note that visualization only works when list (dois in the code) is small and strings are not too long. dot can freeze if the input is too large. And even if it completes, you cannot really analyze such a large graph easily. When I changed dois to ["ABAB", "BABA", "ABBA"] (and without the pruing step), the visualized suffix tree was like:

suffix tree of ["ABAB", "BABA", "ABBA"]

All full sequences are marked as red, and ["AB", "BA"] is the result. Your original example can be visualized as well, but the image is way too large to display here.

BTW, there is another question similar to this one, and the answers there mostly involves dynamic programming: Longest common substring from more than two strings. I'd admit DP is more precise.

Edit: The DP solution in the link runs much faster than this suffix tree solution. I think that's the real answer.

12
On

I think it's overkill to use any fancy matching library for this and would start with a function that works with two strings:

def common_2(s1, s2):
    longest = ""
    for i in range(min(len(s1), len(s2))):
        if s1[i] == s2[i]:
            longest += s1[i]
        else:
            break
    return longest

Then just apply this repeatedly to all the strings:

def common(ss):
    if len(ss) < 1:
        return ""
    if len(ss) == 1:
        return ss[0]
    part = common_2(ss[0], ss[1])
    for i in range(2, len(ss)):
        part = common_2(part, ss[i])
    return part

>>> journals = ['10.1001/jamacardio.2016.5501', '10.1001/jamacardio.2017.3145', '10.1001/jamacardio.2018.3029', '10.1001/jamacardio.2020.5573', '10.1001/jamacardio.2020.0647']
>>> common(journals)
'10.1001/jamacardio.20'

This only finds the common prefix; if you want general substrings, just modify common_2.

1
On

Simple brute force one, with your list multiplied by 1000 (so it's 5000 strings) it still takes less than a second:

ss = ['10.1001/jamacardio.2016.5501', '10.1001/jamacardio.2017.3145', '10.1001/jamacardio.2018.3029', '10.1001/jamacardio.2020.5573', '10.1001/jamacardio.2020.0647']

def substrings(s):
    return {s[i:j]
            for j in range(len(s)+1)
            for i in range(j+1)}

common = set.intersection(*map(substrings, ss))

print(max(common, key=len))