We have different types of hyphens/dashes (in some text) populated in db. Before comparing them with some user input text, i have to normalize any type of dashes/hyphens to simple hyphen/minus (ascii 45).
The possible dashes we have to convert are:
Minus(−) U+2212 − or − or −
Hyphen-minus(-) U+002D -
Hyphen(-) U+2010
Soft Hyphen U+00AD ­
Non-breaking hyphen U+2011 ‑
Figure dash(‒) U+2012 (8210) ‒ or ‒
En dash(–) U+2013 (8211) –, – or –
Em dash(—) U+2014 (8212) —, — or —
Horizontal bar(―) U+2015 (8213) ― or ―
These all have to be converted to Hyphen-minus(-) using gsub. I've used CharDet gem to detect the character encoding type of the fetched string. It's showing windows-1252. I've tried Iconv to convert the encoding to ascii. But it's throwing an exception Iconv::IllegalSequence.
ruby -v => ruby 1.8.7 (2009-06-12 patchlevel 174) [i686-darwin9.8.0]
rails -v => Rails 2.3.5
mysql encoding => 'latin1'
Any idea how to accomplish this?
Caveat: I know nothing about Ruby, but you have problems that are nothing to do with the programming language that you are using.
You don't need to convert
Hyphen-minus(-) U+002D -
tosimple hyphen/minus (ascii 45)
; they're the same thing.You believe that the database encoding is
latin1
. The statement "My data is encoded in ISO-8859-1 aka latin1" is up there with "The check is in the mail" and "Of course I'll still love you in the morning". All it tells you is that it is a single-byte-per-character encoding.Presuming that "fetched string" means "byte string extracted from the database",
chardet
is very likely quite right in reportingwindows-1252
akacp1252
-- however this may be by accident aschardet
sometimes seems to report that as a default when it has exhausted other possibilities.(a) These Unicode characters cannot be decoded into
latin1
orcp1252
orascii
:What gives you the impression that they may possibly appear in the input or in the database?
(b) These Unicode characters can be decoded into
cp1252
but notlatin1
orascii
:These (most likely the EN DASH) are what you really need to convert to an ascii hyphen/dash. What was in the string that
chardet
reported aswindows-1252
?(c) This can be decoded into
cp1252
andlatin1
but notascii
:If a string contains non-ASCII characters, any attempt (using
iconv
or any other method) to convert it toascii
will fail, unless you use some kind of "ignore" or "replace with?
" option. Why are you trying to do that?