Creating a filtering thesaurus in Postgresql

989 Views Asked by At

I am using Postgresql for full text search and I am having trouble creating a filtering thesaurus, in the way described by Postgresql documentation on Full Text Search using dictionaries (12.6).

I understand that the documentation only discusses a filtering dictionary, which is a program that accepts a token as input and returns a single lexeme with the TSL_FILTER flag set, to replace the original token with a new token to be passed to subsequent dictionaries. My question is: is it possible to create a thesaurus, which accepts a phrase (1-3 tokens) and returns a single lexeme with the TSL_FILTER flag set which is passed to a subsequent dictionary or thesaurus? If so, what am I doing wrong?

I attempted to create a new extension called dict_fths, which is basically the same as the default thesaurus Postgresql offers except that each lexeme a phrase is mapped to has the TSL_FILTER flag set. I create two text search dictionaries called fths and second_ths in the following way:

# CREATE EXTENSION dict_fths;
# CREATE TEXT SEARCH DICTIONARY fths (
    template=fths_template, 
    dictionary=english_stem, 
    dictfile=fths_sample
);
# CREATE TEXT SEARCH DICTIONARY second_ths (
    template=thesaurus,
    dictionary=english_stem,
    dictfile=second_ths
);
# CREATE TEXT SEARCH CONFIGURATION test ( COPY=pg_catalog.english );
# ALTER TEXT SEARCH CONFIGURATION test 
  ALTER MAPPING FOR asciihword, asciiword, hword, hword_asciipart, hword_part, word
  WITH fths, second_ths, english_stem;

dict_fths behaves correctly when the mapping is between a single token and a single lexeme.

fths_sample.ths entries:

ski : sport

second_ths.ths entries:

sport competition : *sporting-event

Output (correct, correct):

# select to_tsvector('test', 'ski');
    to_tsvector
  ---------------
   'sport':1
(1 row)

# select to_tsvector('test', 'ski competition');
    to_tsvector
  ---------------
   'sporting-event':1
(1 row)

However when I edited the ths files to include phrases, I no longer get the output that I desire:

fths_sample.ths entries:

ski : sport
ski jumping : sport

Output (correct, correct, incorrect, incorrect):

# select to_tsvector('test','ski');
    to_tsvector
  ---------------
   'sport':1
(1 row)

# select to_tsvector('test','ski jumping');
    to_tsvector
  ---------------
   'sport':1
(1 row)

# select to_tsvector('test' 'ski competition');
    to_tsvector
  ---------------
   'sport':1 'competit':2
(1 row)

# select to_tsvector('test', 'ski jumping competition');
    to_tsvector
  ---------------
   'sport':1 'competit':2
(1 row)

Even after I edited the fths_sample.ths file, the output is still incorrect:

fths_sample.ths contains:

ski jumping : sport

Here is the output (correct, incorrect):

# select to_tsvector('test', 'ski jumping');
    to_tsvector
  ---------------
   'sport':1
(1 row)

# select to_tsvector('test', 'ski jumping competition');
    to_tsvector
  ---------------
   'sport':1 'competit':2
(1 row)

So it seems that the thesaurus fails to pass the lexeme when 1) it has more than 1 token 2) it is part of a longer phrase.

0

There are 0 best solutions below