"\x9D" to UTF-8 in conversion from Windows-1252 to UTF-8

5.1k Views Asked by At

I have created a csv uploader on my rails app, but sometimes I get an error of

"\x9D" to UTF-8 in conversion from Windows-1252 to UTF-8

This is the source to my uploader:

def self.import(file)
  CSV.foreach(file.path, headers: true, encoding: "windows-1252:utf-8") do |row|
    title = row[1]
    row[1] = title.to_ascii
    description = row[2]
    row[2] = description.to_ascii
    Event.create! row.to_hash
  end
end

I am using the unidecode gem (https://github.com/norman/unidecoder) to normalize any goofy characters that a user may input. I've ran into this error a few times, but can't determine how to fix it. I thought the encoding: "windows-1252:utf-8" line would fix the problem, but nothing there.

Thanks stack!

2

There are 2 best solutions below

3
On

There is no 9D character (as well as 81, 8D, 8F, 90) in Windows-1252. It means your text is not in Windows-1252 encoding. At the very least your source text is corrupt.

0
On

I was running into this error from reading url contents:

table = CSV.parse(URI.open(document.url).read)

Turns out the API I am fetching conditionally returns GZIP if the file is too large.

Another annoying thing is that rails decompression was then failing on a valid UTF8 error.

This did NOT work:

ActiveSupport::Gzip.decompress(URI.open(document.url).read)

This did work:

Zlib::GzipReader.wrap(URI.open(document.url), &:read)

My next problem is the CSV.parse() reads the entire blob, and I had a single line with errors.

downloaded_file = StringIO.new(Zlib::GzipReader.wrap(URI.open(document.url), &:read))
tempfile = Tempfile.new("open-uri", binmode: true)
IO.copy_stream(downloaded_file, tempfile.path)
headers = nil
File.foreach(tempfile.path) do |line|
  row = []
  if headers.blank?
    headers = CSV.parse_line(line, { col_sep: "\t", liberal_parsing: true })
  else
    line_data = CSV.parse_line(line.force_encoding("UTF-8").encode('UTF-8', :invalid => :replace, :undef => :replace), { col_sep: "\t", liberal_parsing: true })
    row = headers.zip(line_data)
  end
  puts row.inspect
  ... # do a lot more stuff
end

wow.