Problems trying to parse CSV file with umlaut's etc in it using Ruby 1.8 / FasterCSV

2.4k Views Asked by At

I have a CSV file with lines like this in it:

...,"Städtische Galerie im Lenbachhaus",...

I am using Ruby 1.8, with the FasterCSV gem, like so:

FasterCSV.foreach(file, :encoding => 'u', :headers => :first_row) do |r|
    as = ImportObject.create!(r.to_hash)
end

For most rows its working fine, but for these rows the field with the special character is getting truncated, so we get "St" saved in the db.

I have put $KCODE="u" and with/without the encoding option, to no avail.

The DB is MySQL.

EDIT:

I tried pushing the code up to Heroku (Postgres) and now getting a new error:

2011-02-19T17:19:01-08:00 app[web.1]: ActiveRecord::StatementInvalid (PGError: ERROR: invalid byte sequence for encoding "UTF8": 0xe46474

2011-02-19T17:19:01-08:00 app[web.1]: HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding".

2011-02-19T17:19:01-08:00 app[web.1]: : INSERT INTO "import_objects" (... "title", ...) VALUES (..., 'St?dtische Galerie im Lenbachhaus', ...) RETURNING "id"):

:(

3

There are 3 best solutions below

1
wuputah On BEST ANSWER

The problem is likely a file encoding issue, as you have surmised. The most likely scenario is your file is not actually encoded with UTF-8, so the rest of your application cannot recognize the foreign encoding. It's also possible -- but I believe quite unlikely -- that one of the bytes used in the encoding is a quote or comma in ASCII, which will mess up FasterCSV parsing the data.

First, make a test file with just the "problem row" in your CSV file. Next, read the data in the file:

text_in = File.read('data.csv')

Now you have to convert it. The problem is, you don't really know what it is. You'll have to try a few different things. My best guess is the text is Latin-1 encoded.

require 'iconv'
text_out = Iconv.conv("UTF8", "LATIN1", text_in)

Now try to import this data. Alternatively, you can write to disk and open it, and see if it's encoded properly.

But honestly, you can do this outside of Ruby much more easily.

$ iconv -t UTF8 -f LATIN1 < data.csv > data_conv.csv

Further reading:

2
wuputah On

The problem is not FasterCSV, as in my testing, FasterCSV does not have a problem reading this data. For instance:

>> FasterCSV.parse("a,Städtische Galerie im Lenbachhaus,b,ä", :headers => [:a,:b,:c,:d]) do |r|
|    r = r.to_hash
|    p r
|    puts r[:d]
|  end  
{:c=>"b", :a=>"a", :d=>"\303\244", :b=>"Städtische Galerie im Lenbachhaus"}
ä

Note that Ruby 1.8 doesn't handle unicode characters properly, but principally this affects things like String#length. For instance, Ruby will return the length of this string as 34 instead of 33. However this doesn't have an affect until you do something with the string, like run a validation on it.

>> "Städtische Galerie im Lenbachhaus".length
=> 34
>> "Stadtische Galerie im Lenbachhaus".length
=> 33

So my guess is it's something about ImportObject or how your database connection is configured.


Ruby version used in these tests:

>> RUBY_DESCRIPTION 
=> "ruby 1.8.7 (2010-04-19 patchlevel 253) [i686-darwin10.4.0], MBARI 0x6770, Ruby Enterprise Edition 2010.02"
1
the Tin Man On

You don't say what database type you're using, but it's very possible the DB is not configured for UTF-8, and instead is expecting ASCII. Throwing non-ASCII characters at it could result in a truncated string, a missing character, or a character replaced with a placeholder, depending on the database and what gem or ORM you're using to talk to it. When I build a database I make sure it's configured for UTF-8, or, I make sure the text I push into it is encoded so it can make a round-trip without corruption or loss. I learned that lesson the same way you are, the hard way.

Check the database's log, and/or, check your code to see whether you can enable logging and error and warning messages for the database inserts.

It's easy to disable warnings and errors with a lot of databases, but during development you don't want to do that. Those messages are important and can signal big problems to come. Ignoring them and pushing code to production can be a real recipe for sleepless nights.