Calculating the hamming distance between the binary value of characters between two strings in R

324 Views Asked by At

I need help guys!

Imagine I have a vector fp with n elements.

Each element is a string with 64 characters (always) and I want to build a distance matrix of all the elements. Each character of each element is either hexadecimal (0,1,2,3,4,5,6,7,8,9,a,b,c,d,e,f), - or X where - means absence and X means any value.

The distance between two characters must be the hamming distance of the binary representation of each element with the exception of - and X:

  • if the character are equal, the distance remains the same
  • if any of the character is X, the distance also remains the same
  • if any of the character is - and they are different, 5 is added to the distance
  • if they are different, the hamming distance between the binary representation of the characters is added to the distance

I was able to build a script to functionally calculate this:

 dist = data.frame()
 for(m in 1:length(fp)){
    for(l in 1:length(fp)){
      d=0
      for(k in 1:nchar(fp[l])){
        if(substr(fp[m],k,k) == substr(fp[l],k,k)){d = d}
        else if((substr(fp[m],k,k)=="X")|((substr(fp[l],k,k)=="X"))){d = d}
        else if((substr(fp[m],k,k)=="-")|((substr(fp[l],k,k)=="-"))){d = d+5}
        else{
          d = d+sum(stringdist(as.character(as.binary(as.hexmode(substr(fp[m],k,k)),n=4)),as.character(as.binary(as.hexmode(substr(fp[l],k,k)),n=4))))
        }
      }
      dist[l,m] = d
    }
  }

but when fp is 200+ long, it gives me a error message:

Error: memory exhausted (limit reached?)
Error during wrapup: memory exhausted (limit reached?)
Error: no more error handlers available (recursive errors?); invoking 'abort' restart

I already used the Sys.setenv('R_MAX_VSIZE'=32000000000) and it still gives the error.

Any idea of what to do?

0

There are 0 best solutions below