Using `grepl` on Hex characters and escaped unicode non-ASCII characters with `stringi::stri_escape_unicode()`

63 Views Asked by At

Old question

I have an R package in which I have a list of university names that I want to match to the user input. The list of names contains special characters and this is generating a warning in R CMD check:

checking data for non-ASCII characters (855ms)
     Warning: found non-ASCII strings

Right now, I am converting these non-ASCII unicode characters to their ASCII-compliant escaped version to get rid of this warning using stringi::stri_escape_unicode(). However, now my required grepl in other functions do not seem to work as expected:

The user provides the text with an accent (e.g., "é"), and I match it to my data base. But it is not detecting the string correctly, it seems:

a <- stringi::stri_escape_unicode("é")

a == "é"
#> [1] FALSE

a == "\\é"
#> [1] FALSE

grepl("é", a)
#> [1] FALSE

Created on 2023-09-03 with reprex v2.0.2

Note: Follow-up on this.


Edit:

List of universities

The list of universities is created here: https://github.com/rempsyc/pubmedDashboard/blob/master/data-raw/universities.R

But now I realize (thanks for pointing this aspect out in comments) that when I make the comparison between the university name and affiliation address, I use the output from easyPubMed, which actually seems to use a different encoding (although the default is supposed to use UTF8).

In the RStudio viewer, when hovering over the affiliation address, it looks like this, e.g., "Département de Psychologie, Université du Québec à Montréal, Montréal".

However, it seems that it is actually encoded like this (edit: even here on stackoverflow, the symbols are rendered correctly so I have to put them as code):

"D&#xe9;partement de Psychologie, Universit&#xe9; du Qu&#xe9;bec &#xe0; Montr&#xe9;al, Montr&#xe9;al"

I realize that this is probably why the comparison is not working:

"é" == "&#xe9;"
#> [1] FALSE

Created on 2023-09-03 with reprex v2.0.2

My mistake was assuming that what I was seeing using the mouse hovering in RStudio was the real characters, whereas under the hood it was something different.

Hex Characters

From what I can gather, this form of encoding corresponds to Unicode Hex Character Code. What I need is probably another function from stringi to convert from Unicode Hex Character Code to regular encoding.

Here is a reprex of the address affiliations as provided by the easyPubMed package:

library(easyPubMed)
dami_query_string <- "Dualistic Model of Passion [Text Word] AND ('2023/01/01' [Date - Publication] : '2023/12/31' [Date - Publication])"
dami_on_pubmed <- get_pubmed_ids(dami_query_string)
pubmed_data <- fetch_pubmed_data(dami_on_pubmed)
dami_abstracts_list <- articles_to_list(pubmed_data)
article_to_df(dami_abstracts_list[[5]])$address[1]
#> [1] "Laboratoire de Recherche sur le Comportement Social, D&#xe9"

However due to a bug inside the article_to_df, the full address is not showed. For this, we have to use pubmedDashboard’s version:

# remotes::install_github("rempsyc/pubmedDashboard")
library(pubmedDashboard)
article_to_df2(dami_abstracts_list[[5]])$address[1]
#> [1] "Laboratoire de Recherche sur le Comportement Social, D&#xe9;partement de Psychologie, Universit&#xe9; du Qu&#xe9;bec &#xe0; Montr&#xe9;al, Montr&#xe9;al, QC H3C 3P8, Canada."

Created on 2023-09-03 with reprex v2.0.2

What is the detected encoding?

stringi::stri_enc_detect("&#xe9;")
#> [[1]]
#>   Encoding Language Confidence
#> 1    UTF-8                0.15
#> 2 UTF-16BE                0.10
#> 3 UTF-16LE                0.10

Created on 2023-09-03 with reprex v2.0.2

Converting Hex Characters

This website conversion seems to do the trick, but I am not able to do the same with stringi...

enter image description here

Reprex:

stringi::stri_encode("&#xe9;", from = "UTF8", to = "ASCII")
#> [1] "&#xe9;"

stringi::stri_encode("&#xe9;", from = "UTF16", to = "ASCII")
#> Warning in stringi::stri_encode("&#xe9;", from = "UTF16", to = "ASCII"): the
#> Unicode code point \U00002623 cannot be converted to destination encoding
#> Warning in stringi::stri_encode("&#xe9;", from = "UTF16", to = "ASCII"): the
#> Unicode code point \U00007865 cannot be converted to destination encoding
#> Warning in stringi::stri_encode("&#xe9;", from = "UTF16", to = "ASCII"): the
#> Unicode code point \U0000393b cannot be converted to destination encoding
#> [1] "\032\032\032"

stringi::stri_encode("&#xe9;", from = "Hex", to = "ASCII")
#> Error in stringi::stri_encode("&#xe9;", from = "Hex", to = "ASCII"): The requested ICU resource file cannot be found. (U_FILE_ACCESS_ERROR)

Created on 2023-09-03 with reprex v2.0.2

stringi::stri_unescape_unicode("&#xe9;")
#> [1] "&#xe9;"

Created on 2023-09-03 with reprex v2.0.2

Changing encoding from easyPubMed

I'm thinking that perhaps if I can change the encoding directly from easyPubMed, that will make my life easier. This is possible according to the package documentation:

Note that we included an argument (namely, encoding) to force the encoding of the retrieved records. Here, we recommend “UTF8”. However, you can select different encodings (depending on the local platform). As an example, here we are specifying encoding="ASCII".

But it seems I am not able to do this correctly for hex:

library(easyPubMed)
dami_query_string <- "Dualistic Model of Passion [Text Word] AND ('2023/01/01' [Date - Publication] : '2023/12/31' [Date - Publication])"
dami_on_pubmed <- get_pubmed_ids(dami_query_string)
pubmed_data <- fetch_pubmed_data(dami_on_pubmed)

dami_abstracts_list <- articles_to_list(pubmed_data, encoding = "ASCII")
article_to_df(dami_abstracts_list[[5]])$address[1]
#> [1] "Laboratoire de Recherche sur le Comportement Social, D&#xe9"

dami_abstracts_list <- articles_to_list(pubmed_data, encoding = "UTF8")
article_to_df(dami_abstracts_list[[5]])$address[1]
#> [1] "Laboratoire de Recherche sur le Comportement Social, D&#xe9"

dami_abstracts_list <- articles_to_list(pubmed_data, encoding = "UTF16")
article_to_df(dami_abstracts_list[[5]])$address[1]
#> [1] "Laboratoire de Recherche sur le Comportement Social, D&#xe9"

dami_abstracts_list <- articles_to_list(pubmed_data, encoding = "latin")
article_to_df(dami_abstracts_list[[5]])$address[1]
#> [1] "Laboratoire de Recherche sur le Comportement Social, D&#xe9"

dami_abstracts_list <- articles_to_list(pubmed_data, encoding = "unicode")
article_to_df(dami_abstracts_list[[5]])$address[1]
#> [1] "Laboratoire de Recherche sur le Comportement Social, D&#xe9"

dami_abstracts_list <- articles_to_list(pubmed_data, encoding = "hex")
article_to_df(dami_abstracts_list[[5]])$address[1]
#> [1] "Laboratoire de Recherche sur le Comportement Social, D&#xe9"

Created on 2023-09-03 with reprex v2.0.2

1

There are 1 best solutions below

2
rempsyc On

In a first step, we must define a new function to convert hex characters to regular text:

convert_hex_to_char <- function(hex_string) {
  # Extract all hex codes from the string using stringi::stri_extract_all_regex
  hex_codes <- stringi::stri_extract_all_regex(hex_string, "&#x[0-9a-fA-F]+;")[[1]]
  
  # Convert hex codes to characters
  chars <- lapply(hex_codes, function(hex_code) {
    int_val <- strtoi(gsub("&#x(.*);", "\\1", hex_code), base=16)
    char <- intToUtf8(int_val)
    return(char)
  })
  chars <- as.character(chars)
  
  # Replace hex codes with characters in the original string
  for (i in seq_along(hex_codes)) {
    hex_string <- sub(hex_codes[i], chars[i], hex_string)
  }
  
  return(hex_string)
}

# Test the function
sentence <- "D&#xe9;partement de Psychologie, Universit&#xe9; du Qu&#xe9;bec &#xe0; Montr&#xe9;al, Montr&#xe9;al"
converted_sentence <- convert_hex_to_char(sentence)
converted_sentence
#> [1] "Département de Psychologie, Université du Québec à Montréal, Montréal"

It is not enough to convert hex characters to regular text, however, because even this will not match the escaped unicode. Therefore, we need to unescape the university names (simplified in the example below):

pattern <- stringi::stri_unescape_unicode(stringi::stri_escape_unicode("Université"))
grepl(pattern, converted_sentence)
#> [1] TRUE

Note: thanks to ChatGPT for this function. There is almost certainly an existing function or package that does this, and I will accept another answer with such a solution.

Created on 2023-09-03 with reprex v2.0.2