How to identify non utf8 symbols in a range of text?

162 Views Asked by At

In our database we have Latin and Cyrillic alphabets and all is fine with those characters. But, sometimes people enter characters that appear as ? or squares. Is there a way to find those specific symbols, because looking manually through a couple of tables with >400 000 records for 1 mistake is insane.

I have looked for online solutions that I can load my data as text, I have tried notepad++, but either I get all Cyrillic marked or I don't get anything.

EDIT: L..D. is an example.

1

There are 1 best solutions below

14
psaraj12 On BEST ANSWER

Based on the ASCII value of the character you can filter the text.Here i am assuming all characters greater than 65534 as outside the normal range.You can modify the range according to your requirement.The db fiddle here

 with test (col) as (
  select 
    'L.A.D' 
  from 
    dual 
  union all 
  select 
    'L..D.' 
  from 
    dual
) 
select 
  col, 
  case when max(ascii_of_one_character) >= 65535 then 'NOT OK' else 'OK' end result 
from 
  (
    select 
      col, 
      substr(col, column_value, 1) one_character, 
      ascii(
        substr(col, column_value, 1)
      ) ascii_of_one_character 
    from 
      test cross 
      join table(
        cast(
          multiset(
            select 
              level 
            from 
              dual connect by level <= length(col)
          ) as sys.odcinumberlist
        )
      )
  ) 
group by 
  col
  having max(ascii_of_one_character) >= 65535;