How to determine whether utf-8 or cp1252 encoding?

3.2k Views Asked by At

Is there way in perl to determine which of utf-8 or cp1252 the encoding of a string is?

2

There are 2 best solutions below

0
On
my $could_be_utf8 = utf8::decode( my $tmp = $string );

my $could_be_cp1252 = $string !~ /[\x81\x8D\x8F\x90\x9D]/;

If you need to handle a string that contains a mix of both, see Fixing a file consisting of both UTF-8 and Windows-1252.

0
On

The core Encode::Guess should be up to task for this

use Encode::Guess;

my $enc = guess_encoding($data, qw(cp1252));  # utf8 among defaults

and then

ref($enc) or die "Can't guess: $enc"; # trap error this way
$utf8 = $enc->decode($data);

(from docs).

In order to not also use the default "ascii, utf8 and UTF-16/32 with BOM" change that first

Encode::Guess->set_suspects(qw(utf8 cp1252));

and then get the encoding

my $enc = guess_encoding($data);

Or, copied from docs

my $decoder = Encode::Guess->guess($data);
die $decoder unless ref($decoder);
my $utf8 = $decoder->decode($data);

See documentation for details.


There are plenty of differences; see comment by tripleee and for example this post