How do I open in R a downloaded .csv file that contains both correct accented characters and faulty ones?

69 Views Asked by At

I have a .csv file that contains both correct and misread accented characters. For example, on the first line I have "Veríssimo", and on the second I have "VirgÃ-nia" (was supposed to be Virgínia). If I do nothing, it opens the file with "Virgínia" misspeled. If I try one of the correction methods I know, such as saving the file with UTF-8 encoding, then "Veríssimo" is misspelled.

In R, I tried: dados_MG2 <- read_csv("dados_MG.csv") which detects UTF-8 encoding and opens with "Veríssimo" misspelled.

dados_MG <- read_csv("Dados/extra/dados_MG.csv", locale = locale(encoding = "ISO-8859-1")) I tried forcing a different encoding, and with it, "Veríssimo" is spelled correctly, but "Virgínia" is not.

Here is the link to my dataset: https://github.com/elisa-fink/THM

2

There are 2 best solutions below

0
Robert Hacken On BEST ANSWER

You can use nchar to find strings with invalid UTF-8 characters and replace these with strings read using the ISO-8859-1 encoding:

dados_MG <- read_csv('dados_MG.csv')
dados_MG.iso <- read_csv('dados_MG.csv', locale = locale(encoding = 'ISO-8859-1'))

not.utf <- is.na(nchar(dados_MG$DS_NOME, allowNA=T))
dados_MG$DS_NOME[not.utf] <- dados_MG.iso$DS_NOME[not.utf]

grep('^Ver.ss|^Virg.ni', dados_MG$DS_NOME, value=T)
# [1] "Veríssimo" "Virgínia" 

A more simple variant with only one reading of the CSV file (inspired by @rps1227's answer):

dados_MG <- read_csv('dados_MG.csv')
not.utf <- is.na(nchar(dados_MG$DS_NOME, allowNA=T))
dados_MG$DS_NOME[not.utf] <- 
  iconv(dados_MG$DS_NOME[not.utf], from='ISO-8859-1', to='UTF-8')
0
rps1227 On

Another option that only requires reading the file once and uses base::iconv() to convert the names that are not encoded in UTF-8 in the original file:

dados_MG <- read_csv("./dados_MG.csv") %>%
  mutate(encode_issues = is.na(unlist(lapply(DS_NOME, nchar, allowNA = TRUE)))) %>%
  mutate(DS_NOME = if_else(encode_issues,
                           iconv(DS_NOME, from = "ISO-8859-1",
                                 to = "UTF-8"),
                           DS_NOME)) %>%
  select(-encode_issues)

grep('^Ver.ss|^Virg.ni', dados_MG$DS_NOME, value=T)
#> [1] "Veríssimo" "Virgínia"

Created on 2023-09-04 with reprex v2.0.2