Read GB2312 encoding page using Ruby

482 Views Asked by At

I am trying to parse GB2312 encoded page (http://news.qq.com/a/20140824/015032.htm), and this is my code.

I am not yet into the parsing part, just in the open and read, and I got error.

This is my code:

require 'open-uri'
open("http://news.qq.com/a/20140824/015032.htm").read

And this is the error:

Encoding::InvalidByteSequenceError: "\x8B" on GB2312

I am using Ruby 2.0.0p247

Any solution?

3

There are 3 best solutions below

0
On BEST ANSWER

I don't know exactly why this happens when calling .read, but you can work around it if you are using Nokogiri. Just pass the file object directly to Nokogiri without calling .read:

require 'open-uri'
file = open("http://news.qq.com/a/20140824/015032.htm")
document = Nokogiri(file)
0
On

you can try this

document = Nokogiri::HTML(open("http://news.qq.com/a/20140824/015032.htm"), nil, "GB18030")
0
On

I cannot duplicate the error using 2.0.0p247,

require 'open-uri'
open("http://news.qq.com/a/20140824/015032.htm").read

Works fine.

However

require 'open-uri'
open("http://news.qq.com/a/20140824/015032.htm").read.encode('utf-8')

will raise the error

Encoding::InvalidByteSequenceError: "\x8B" on GB2312

Are you trying to do some encoding conversion?