Replacing characters in R string based on raw hex values

137 Views Asked by At

Suppose I have a string in R,

mystring = 'help me'

but with a twist: The space between 'help' and 'me' is actually a non-breaking space. Non-breaking space is stored in R as <c2 a0>, so this string can be created by

mystring = rawToChar(as.raw(as.hexmode(c('68','65','6c','70','c2','a0','6d','65'))))

Then, for example, grepl('help me', mystring) will be FALSE

how can I replace the non-breaking space with a regular space? And in general, replace any particular raw value(s) with a particular character? Ideally I will be able to make a function like

gsubRaw('mystring',as.raw(as.hexmode(c(('c2','a0'))), ' ')

This answer almost answers my question, except that I don't want to replace ALL non-ascii characters with a space, only the non breaking space.

grepRaw() also came close, because it can detect the position in the string that the raw characters occur and they can then be replaced. However, it didn't work cleanly: sometimes the position in the string that grepRaw() returned wasn't the same as the position of the non-breaking space in the string-as-plain-text, and I don't know how to replace the raw values themselves.

3

There are 3 best solutions below

0
On BEST ANSWER

From comments on my answer to the other question we can do this by using the fact that the non-breaking space is \xc2\xa0 (at least in R 4.3.1 on Windows)

mystring = rawToChar(as.raw(as.hexmode(c('68','65','6c','70','c2','a0','6d','65'))))
grepl('help me', mystring)
#> [1] FALSE
tools::showNonASCII(mystring)
#> 1: help<c2><a0>me

identical('help\xc2\xa0me', mystring)
#> [1] TRUE

mynewstring = gsub('\xc2\xa0+', ' ', mystring)
grepl('help me', mynewstring)
#> [1] TRUE
tools::showNonASCII(mynewstring)

Created on 2023-07-05 with reprex v2.0.2

0
On

You could use the replacement operator:

gsubRaw <- function(string, pattern, replacement){
  d <- (b <- charToRaw(string)) %in% as.raw(as.hexmode(pattern))
  b[d] <- charToRaw(replacement)
  b[(e <- which(d))[c(0,diff(e)) == 1]] <- as.raw(0)
  rawToChar(b[b != as.raw(0)])
}

tst <- gsubRaw(mystring, c("c2", "a0"), " ")
tst
#> [1] "help me"
grepl(" ", mystring)
#> [1] FALSE
grepl(" ", tst)
#> [1] TRUE
0
On

Here's an option. You specify the replacement in plain text (e.g., " "). The function converts that to raw characters. Then, you revert your string to raw characters and paste them all together with a colon (making a single string). Then, you do the same with the replacement raw characters. You then replace instances of the raw character pattern string with the raw character replacement string. You split the string on the character you used to join them (a colon in the example below) and then revert the string from raw back to plain text.

library(stringr)
mystring = rawToChar(as.raw(as.hexmode(c('68','65','6c','70','c2','a0','6d','65'))))

gsubRaw <- function(mystring, pattern, replacement){
  rpl <- charToRaw(replacement)
  r <- charToRaw(mystring)
  r2 <- paste(r, collapse=":")
  pat <- paste(pattern, collapse=":")
  r2 <- gsub(pat, rpl, r2)
  s <- c(str_split(r2, ":", simplify=TRUE))
  rawToChar(as.raw(as.hexmode(s)))
}        

tst <- gsubRaw(mystring, c("c2", "a0"), " ")
tst
#> [1] "help me"
grepl(" ", mystring)
#> [1] FALSE
grepl(" ", tst)
#> [1] TRUE

Created on 2023-07-02 with reprex v2.0.2