I have a CSV file with lines like this in it:
...,"Städtische Galerie im Lenbachhaus",...
I am using Ruby 1.8, with the FasterCSV gem, like so:
FasterCSV.foreach(file, :encoding => 'u', :headers => :first_row) do |r|
as = ImportObject.create!(r.to_hash)
end
For most rows its working fine, but for these rows the field with the special character is getting truncated, so we get "St" saved in the db.
I have put $KCODE="u" and with/without the encoding option, to no avail.
The DB is MySQL.
EDIT:
I tried pushing the code up to Heroku (Postgres) and now getting a new error:
2011-02-19T17:19:01-08:00 app[web.1]: ActiveRecord::StatementInvalid (PGError: ERROR: invalid byte sequence for encoding "UTF8": 0xe46474
2011-02-19T17:19:01-08:00 app[web.1]: HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding".
2011-02-19T17:19:01-08:00 app[web.1]: : INSERT INTO "import_objects" (... "title", ...) VALUES (..., 'St?dtische Galerie im Lenbachhaus', ...) RETURNING "id"):
:(
The problem is likely a file encoding issue, as you have surmised. The most likely scenario is your file is not actually encoded with UTF-8, so the rest of your application cannot recognize the foreign encoding. It's also possible -- but I believe quite unlikely -- that one of the bytes used in the encoding is a quote or comma in ASCII, which will mess up FasterCSV parsing the data.
First, make a test file with just the "problem row" in your CSV file. Next, read the data in the file:
Now you have to convert it. The problem is, you don't really know what it is. You'll have to try a few different things. My best guess is the text is Latin-1 encoded.
Now try to import this data. Alternatively, you can write to disk and open it, and see if it's encoded properly.
But honestly, you can do this outside of Ruby much more easily.
Further reading: