Ignore Arabic diacritical in Oracle Text Indexing

276 Views Asked by At

In Oracle Database 12c R2, I am storing Arabic text mainly, and using oracle text CONTAINS search.

In Arabic, some different characters are used interchangeably and should be treated equally when searched for.

Ex1: the following characters (أ - إ - آ - ا) should be treated the same.

Ex2: each of these characters groups should also be treated the same (ي - ى) , (ة - ه).

Also, diacritical (which is referred to as Tashkeel) should be ignored.

Ex3: ( َ - ً - ُ - ِ - ٍ - ّ - ْ - ـ) all should be ignored.

when I use Auto_Lexer with setting language attribute to Arabic (or using Basic_Lexer) and enabling BASE_LETTER, characters group in the first example will be treated equally, but these settings won't change the behavior for characters groups in Ex2 & Ex3.

Is there a way to tune and extend this base_letter transformation to accomplish the same behavior I've got in Ex1, or any other solution that doesn't include modifying the text on insert.

Here is a code sample:

Create Table DOCUMENT(SUBJECT VARCHAR2(4000 CHAR));

begin
 ctx_ddl.create_preference('my_lexer','AUTO_LEXER');
 ctx_ddl.set_attribute('my_lexer','language','ARABIC');
 ctx_ddl.set_attribute('my_lexer','base_letter','YES');
end;
/


insert into DOCUMENT(SUBJECT) VALUES ('السيد أحمد')  ;
insert into DOCUMENT(SUBJECT) VALUES ('سيادة القاضي')  ;
commit;


create index IX_FULLTEXT_SUBJECT on DOCUMENT (SUBJECT)
  indextype is CTXSYS.CONTEXT
  parameters('SYNC(ON COMMIT) lexer my_lexer');


select * from DOCUMENT  where contains(SUBJECT,'احمد') > 0 ; -- this will return a result
select * from DOCUMENT  where contains(SUBJECT,'القاضى') > 0; -- this won't return a result

Note: I have NLS_LANG set to "ARABIC_UNITED ARAB EMIRATES.AR8MSWIN1256"

and thanks in advance.

Edit: I've already tried base_letter_type attribute also with no use

ctx_ddl.set_attribute('my_lexer','base_letter_type','SPECIFIC');
1

There are 1 best solutions below

3
Cee McSharpface On

Set the BASE_LETTER_TYPE setting to SPECIFIC. Its default setting, GENERIC will not apply language-specific rules.

From documentation:

The SPECIFIC value means that a base-letter transformation that has been specifically defined for your language will be used. This enables you to use accent-sensitive searches for words in your own language, while ignoring accents that are from other languages.

It is important to understand that this affects the actual index content (index tokens are stored with "diacritics" removed), and not just the query as it runs. The full text index must be rebuilt for this to take effect.

ctx_ddl.set_attribute('my_lexer','base_letter_type','SPECIFIC');