Efficiently check if a string is an approximate substring of (approximately contrained in) another string, up to a given error threshold?

Question

Efficiently check if a string is an approximate substring of (approximately contrained in) another string, up to a given error threshold?

1k Views Asked by 太晚了 At 17 February 2021 at 05:58

Take two character strings in C or C++, s1 and s2. It is rather trivial to check if one contains the other exactly.

The following will return true in case s2 is a substring of s1.

In C:

strstr(s1, s2)

In C++:

#include <string>
str.find(str2) != string::npos

With boost:

#include <boost/algorithm/string.hpp>
boost::algorithm::contains(s1, s2)

What I am looking for is an analogous way of efficiently (in terms of speed, not memory) finding whether a string is approximately contained in / is approximately a substring of another up to a given difference threshold. Pretty much like the agrep bash command does in Unix systems for finding files.

For example, suppose the function was called approx_contains. Then approx_contains(s1, s2, 4) could return true/false in case s2 is contained within s1 up to a Levenshtein distance of 4.

Searching online, I only found numerous references and questions on how to calculate Levenshtein distance between two strings, or theoretical algorithms for approximate string pattern matching that go beyond simply checking whether or not one string contains another - which here becomes wasteful.

Trying very hard to avoid reinventing the wheel, how could someone do such a checking in C or C++?

Original Q&A

There are 2 best solutions below

**Kermit the Frog** · Answer 1 · 2021-02-17T10:10:55.970000

I had a test program I made several years ago, but it's still working. The length of the pattern is limited to the number of bits in the CPU's registers. Described here: https://en.wikipedia.org/wiki/Bitap_algorithm. In the "See also" section on this site is a link to an open source lib, https://en.wikipedia.org/wiki/TRE_(computing)

#include <iostream>
#include <string>
#include <vector>
#include <ctype.h>

using namespace std;

typedef unsigned int     UINT;
typedef unsigned __int64 UINT64;

// #define WIDECHAR

#if defined(WIDECHAR)
typedef wstring       String;
typedef wchar_t       Char;
#define CIN           wcin
#define COUT          wcout
#else
typedef string        String;
typedef unsigned char Char;
#define CIN           cin
#define COUT          cout
#endif

template<typename T> T inputValue(const char *prompt) {
  for(;;) {
    COUT << prompt;
    T value;
    CIN >> value;
    if(CIN) return value;
    CIN.clear();
  }
}

// https://en.wikipedia.org/wiki/Bitap_algorithm

class ShiftOr {
private:
#if defined(WIDECHAR)
  static constexpr size_t s_masklength = (0xffff + 1);
#else
  static constexpr size_t s_masklength = (0xff + 1);
#endif // WIDECHAR
  UINT64   m_s;
  UINT     m_patternLen;
  UINT64  *m_mask;

  static constexpr size_t s_maskSizeInBytes = s_masklength * sizeof(m_mask[0]);

  void initMask(const UINT64 *src = nullptr) {
    if(m_mask == nullptr) {
      m_mask = new UINT64[s_masklength];
    }
    if(src) {
      memcpy(m_mask, src, s_maskSizeInBytes); // copy all value from src into m_mask
    } else {
      memset(m_mask, -1, s_maskSizeInBytes); // set all bits in m_mask-array to 1
    }
  }
  void deallocMask() {
    delete[] m_mask;
  }
public:
  ShiftOr()
    : m_s(         0      )
    , m_patternLen(0      )
    , m_mask(      nullptr)
  {
  }
  ShiftOr(const ShiftOr &src)
    : m_s(         src.m_s         )
    , m_patternLen(src.m_patternLen)
    , m_mask(      nullptr         )
  {
    if(src.m_mask) {
      initMask(src.m_mask);
    }
  }
  ShiftOr(const String &pattern, bool ignoreCase=false)
    : m_s(         0      )
    , m_patternLen(0      )
    , m_mask(      nullptr)
  {
    compilePattern(pattern, ignoreCase);
  }
  ShiftOr &operator=(const ShiftOr &src) {
    m_s          = src.m_s;
    m_patternLen = src.m_patternLen;
    if(src.m_mask) {
      initMask(src.m_mask);
    } else {
      deallocMask();
    }
    return *this;
  }
  virtual ~ShiftOr() {
    deallocMask();
  }
  void     compilePattern(const String &pattern, bool ignoreCase=false);
  intptr_t search(        const String &str                           ) const;
  intptr_t searchApprox(  const String &str, UINT maxErrors           ) const;
};

void ShiftOr::compilePattern(const String &pattern, bool ignoreCase) {
  m_patternLen = (UINT)pattern.length();
  if(m_patternLen >= 64) {
    throw string("pattern too long for shiftor-search. max length is 63");
  }
  initMask();
  for(UINT i = 0; i < m_patternLen; i++) {
    const Char ch = pattern[i];
    m_mask[ch] &= ~((UINT64)1 << i);
    if(ignoreCase) {
      if(iswlower(ch)) {
        m_mask[_toupper(ch)] &= ~((UINT64)1 << i);
      } else if(isupper(ch)) {
        m_mask[_tolower(ch)] &= ~((UINT64)1 << i);
      }
    }
  }
  m_s = (UINT64)1 << m_patternLen;
}

intptr_t ShiftOr::search(const String &str) const {
  const UINT64 maskEnd  = m_s;
  UINT64       s        = ~1;
  const Char *start = (Char*)str.c_str(), *end = start + str.length();
  for(const Char *cp = start; cp < end;) {
    s = (s | m_mask[*(cp++)]) << 1;
    if((s & maskEnd) == 0) {
      return cp - start - m_patternLen;
    }
  }
  return -1;
}

intptr_t ShiftOr::searchApprox(const String &str, UINT maxErrors) const {
  if(maxErrors == 0) {
    return search(str);
  }
  const UINT64     maskEnd  = m_s;
  vector<UINT64>   s;
  for(UINT i = 0; i < maxErrors + 1; i++) {
    s.push_back((UINT64)~1);
  }
  UINT64 *sfirst = s.data(), *slast = sfirst + maxErrors;
  const Char *firstChar = (Char*)str.c_str(), *lastChar = firstChar + str.length();
  for(const Char *cp = firstChar; cp < lastChar;) {
    const UINT64 mask = m_mask[*cp++];
    UINT64      *sp   = sfirst, olds = *sfirst;
    *sp = (olds | mask) << 1;

    while(sp++ < slast) {
      const UINT64 tmp = *sp;
      /* Substitution is all we care about */
      *sp = (olds & (tmp | mask)) << 1;
      olds = tmp;
    }
    if((*slast & maskEnd) == 0) {
      return cp - firstChar - m_patternLen;
    }
  }
  return -1;
}

int main(int argc, char **argv) {
  for(;;) {
    const String   pattern    = inputValue<String>("Enter pattern:");
    const bool     ignoreCase = inputValue<Char>("Ignore case[yn]:") == 'y';
    const ShiftOr  A(pattern, ignoreCase);
    const UINT     maxErrors  = inputValue<UINT>("Enter maxErrors:");
    for(;;) {
      const String text = inputValue<String>("Enter text:");
      if((text.length() > 0) && text[0] == '!') {
        break;
      }
      const intptr_t i = A.searchApprox(text,maxErrors);
      cout << "result:" << i << endl;
    }
  }
  return 0;
}

**maxbachmann** · Answer 2 · 2021-02-19T05:31:56.270000

Another option is the library https://github.com/maxbachmann/rapidfuzz-cpp (I am the author) which includes a similarity algorithm called partial_ratio. It performs the following steps:

searches for the longest common subsequences of the two strings
Iterates over those subsequences and calculates a normalized Levenshtein distance between the shorter string and a substring of the longer string, with len(shorter string), that starts at the start of the subsequence
returns score of the optimal alignment

As an example:

#include "rapidfuzz/fuzz.hpp"
#include <string>
std::string a = "txst";
std::string b = "this is a test";
double score = rapidfuzz::fuzz::partial_ratio(a, b);
// score is 75.0

In this example the similarity of the optimal alignment "txst" <-> "test" is calculated. Since only one substitution is required the similarity is 75%.

Getting edit distance instead of relative similarity

As you noted in the comments you are interested in the edit distance instead of relative similarity. Here is a reimplementation of partial_ratio, that does this:

template <typename Sentence1, typename Sentence2>
std::size_t partial_ratio(const Sentence1& s1, const Sentence2& s2, std::size_t max=-1)
{
  auto s1_view = rapidfuzz::common::to_string_view(s1);
  auto s2_view = rapidfuzz::common::to_string_view(s2);

  if (s1_view.empty() && s2_view.empty()) {
    return 0;
  }

  if (s1_view.empty() || s2_view.empty()) {
    return -1;
  }

  // this swap is performed in the original FuzzyWuzzy implementation, so
  // I use it in my implementation as well. Depending on your goals you
  // might want to remove this swap
  if (s1_view.length() > s2_view.length()) {
    return partial_ratio(s2_view, s1_view, max);
  }

  // Note this is a internal API and so it could change at any time
  // currently this is the slowest part, since my bitparallel
  // implementation of the LCS has a bug and so it is disabled for now
  auto blocks = rapidfuzz::detail::get_matching_blocks(s1_view, s2_view);

  // when there is a full match exit early
  for (const auto& block : blocks) {
    if (block.length == s1_view.length()) {
      return 0;
    }
  }

  // find the edit distance at the places with the longest common subsequence
  std::size_t min_dist = (std::size_t)-1;
  for (const auto& block : blocks) {
    std::size_t long_start = (block.dpos > block.spos) ? block.dpos - block.spos : 0;
    auto long_substr = s2_view.substr(long_start, s1_view.length());

    // Note you could use a different weight here like
     // e.g. {1, 1, 1} for the normal Levenshtein distance
    // only {1, 1, 1} and {1, 1, 2} have very fast implementations
    std::size_t dist = rapidfuzz::string_metric::levenshtein(
        s1_view, long_substr, {1, 1, 2}, max);

    if (dist < min_dist) {
      max = min_dist = dist;
    }
  }

  return min_dist;
}

s1and s2can be any type that can be converted to a rapidfuzz::basic_string_view. Some examples are:

std::basic_string (std::string, ...)
std::basic_string_view (std::string_string, ...) since C++17
c strings like char* (this has to find the length once when creating the string_view)
many other type, that have data() and .size() and use continous memory. Examples are boost::string_view or std::vector

For results with a edit distance above max (size_t)-1) i is returned instead.

Efficiently check if a string is an approximate substring of (approximately contrained in) another string, up to a given error threshold?

There are 2 best solutions below

Getting edit distance instead of relative similarity

Related Questions in C++

Related Questions in C

Related Questions in STRING

Related Questions in MATCHING

Related Questions in AGREP

Trending Questions

Popular # Hahtags

Popular Questions