Why is the given code for signature generation on a character string using hash modulo returning character(0)?

70 Views Asked by At

First post on overflow, so I apologize in advance if I step out of formatiquette.

Because of a recent work assignment, I came to require the implementation of string matching across two very large character vectors. Each of these vectors is about 1M elements in length.

I have been working with Levenshtein and JW distances as metrics for similarity, however as might be expected, their O(n) complexity is directly proportional to the length of the strings being matched.

Vectorizing these distance functions over the two character vectors that I have is going to be memory intensive and time intensive, and I expect to run into buffer overflows. I tried running the distance algorithm on a small subset of these vectors and the memory space got occupied pretty heavily.

Therefore, I decided to generate the hash signatures of each string, so that they occupy less space and can be processed faster. The algorithm that I have followed for generating the same can be found here: Super-Fast-Estimate-Levenshtein-Distance

Summarily, the algorithm follows these steps:

  1. Intialize an array (I used a df imaginatively named ascii) of printable ASCII characters for signature elements.
  2. Generate hash of text sub-strings using a fov value that determines the length of the substrings. (I used "md5", should I use "xxhash32"; does it change things significantly?)
  3. Compute the modulo of the integer hash over the compression value decided before implementation.
  4. If it's not equal to zero, break or switch to next in the loop. If it is, return a character from the ascii dataset as: ascii[(int.hash %% length(ascii)),]
  5. Loop until you reach the end of the string.
  6. Return the signature.

I wrote a function trying to implement the same, but it only throws a character(0) at me. Some help in identifying the offending clause will be greatly appreciated.

The parameters for the function are:

  • x : String input for signature generation
  • compress : Compression factor for the signature
  • fov : Substring length that will be considered in each iteration

I am providing the function code and output below:

Code:

library(Rmpfr)
library(digest)

  Sigtex <- function(x,compress,fov){  
  x <- as.character(x)  
  compress <- as.integer(compress)  
  fov <- as.integer(fov)  
  s = character()  
  for(i in 1:nchar(x)){  
  modu <- mpfr(x =digest(substr(x,i,(i+fov)), algo = "md5",serialize = F),base = 16)
    if(modu%%compress != 0){
    if(i==nchar(x)) {break} 
    else {next} } 
    else{s = paste(s,ascii[(modu%%62),],sep = "")}}  

    if(i==nchar(x)) {return(s)} 
    else {next} }

Output:

  >Sigtex("Simbasrteha",123,3)
  character(0)

  >Sigtex("LionKing",123,5)
  character(0)

Thanks again.

0

There are 0 best solutions below