extract corrupted strings

102 Views Asked by At

I received a file that had a weird encoding and wondered if there's any way to check for 'corrupted' strings. For e.g.

dat <- c("天脊煤化工集团股份有é\231\220å…¬å\217¸", "AB \"\"Achema\"\"", 
         "Abu Qir Fertilizers & Chemical", "Abu Zaabal Fertilizer &", 
         "ADP - Adubos De Portugal SA")

The 1 and 2 element in above vector are corrupted since they have strings and escape characters in them. How can I filter these out or generate an index of corrupted strings in the vector dat

2

There are 2 best solutions below

0
On BEST ANSWER
error_string_idx <- which(
  is.na(
    iconv(
      dat,
      to = "ascii"
    ) 
  ) | grepl('\\\\|\\"', dat)
)
0
On

Try this

gsub("[^a-zA-Z]" , "" , dat)

if you don't want empty character use

Filter(function(x) nchar(x) , gsub("[^a-zA-Z]" , "" , dat))