I currently have the need to write a SQL query to determine the number of rows in my UTF8 Oracle database that are not compatible with another system that uses LATIN-1.

For example, Über should not return a result, but 翻译 should

I have tried queries such as:

select decode(convert(convert('Über test', 'WE8ISO8859P1'), 'UTF8'), convert('Über test', 'UTF8'), 1, 0) from dual;

However, this does not give me the result that I need. Can anyone provide a SQL-only solution to this problem? Thanks

2

There are 2 best solutions below

0
On BEST ANSWER

I think I have figured it out:

select * from (select asciistr(convert('test string goes here', 'UTF8')) as str from dual) where regexp_like(str, '.*\\([1-9A-F]|0[1-9A-F]).*');

Using http://en.wikipedia.org/wiki/Latin-1_Supplement_%28Unicode_block%29 as a reference, the LATIN-1 block of unicode ends at \00FF.

For example,

SQL> select * from (select asciistr(convert('翻译', 'UTF8')) as str from dual) where regexp_like(str, '.*\\([1-9A-F]|0[1-9A-F]).*');

STR
------------------------------
\7FFB\8BD1

If someone could double-check this from a logical standpoint, I would appreciate it.

1
On

You didn't bother to mention an Oracle version. Up to 11.2, you should use the Oracle provided character set scanner (CSSCAN) for this purpose. Starting with 12.1, there's new utility called Oracle Database Migration Assistant for Unicode.