In Oracle Database 12c R2, I am storing Arabic text mainly, and using oracle text CONTAINS search.
In Arabic, some different characters are used interchangeably and should be treated equally when searched for.
Ex1: the following characters (أ - إ - آ - ا) should be treated the same.
Ex2: each of these characters groups should also be treated the same (ي - ى) , (ة - ه).
Also, diacritical (which is referred to as Tashkeel) should be ignored.
Ex3: ( َ - ً - ُ - ِ - ٍ - ّ - ْ - ـ) all should be ignored.
when I use Auto_Lexer with setting language attribute to Arabic (or using Basic_Lexer) and enabling BASE_LETTER, characters group in the first example will be treated equally, but these settings won't change the behavior for characters groups in Ex2 & Ex3.
Is there a way to tune and extend this base_letter transformation to accomplish the same behavior I've got in Ex1, or any other solution that doesn't include modifying the text on insert.
Here is a code sample:
Create Table DOCUMENT(SUBJECT VARCHAR2(4000 CHAR));
begin
ctx_ddl.create_preference('my_lexer','AUTO_LEXER');
ctx_ddl.set_attribute('my_lexer','language','ARABIC');
ctx_ddl.set_attribute('my_lexer','base_letter','YES');
end;
/
insert into DOCUMENT(SUBJECT) VALUES ('السيد أحمد') ;
insert into DOCUMENT(SUBJECT) VALUES ('سيادة القاضي') ;
commit;
create index IX_FULLTEXT_SUBJECT on DOCUMENT (SUBJECT)
indextype is CTXSYS.CONTEXT
parameters('SYNC(ON COMMIT) lexer my_lexer');
select * from DOCUMENT where contains(SUBJECT,'احمد') > 0 ; -- this will return a result
select * from DOCUMENT where contains(SUBJECT,'القاضى') > 0; -- this won't return a result
Note: I have NLS_LANG set to "ARABIC_UNITED ARAB EMIRATES.AR8MSWIN1256"
and thanks in advance.
Edit: I've already tried base_letter_type attribute also with no use
ctx_ddl.set_attribute('my_lexer','base_letter_type','SPECIFIC');
Set the
BASE_LETTER_TYPEsetting toSPECIFIC. Its default setting,GENERICwill not apply language-specific rules.From documentation:
It is important to understand that this affects the actual index content (index tokens are stored with "diacritics" removed), and not just the query as it runs. The full text index must be rebuilt for this to take effect.