I am using Postgresql for full text search and I am having trouble creating a filtering thesaurus, in the way described by Postgresql documentation on Full Text Search using dictionaries (12.6).
I understand that the documentation only discusses a filtering dictionary, which is a program that accepts a token as input and returns a single lexeme with the TSL_FILTER flag set, to replace the original token with a new token to be passed to subsequent dictionaries. My question is: is it possible to create a thesaurus, which accepts a phrase (1-3 tokens) and returns a single lexeme with the TSL_FILTER flag set which is passed to a subsequent dictionary or thesaurus? If so, what am I doing wrong?
I attempted to create a new extension called dict_fths, which is basically the same as the default thesaurus Postgresql offers except that each lexeme a phrase is mapped to has the TSL_FILTER flag set. I create two text search dictionaries called fths and second_ths in the following way:
# CREATE EXTENSION dict_fths;
# CREATE TEXT SEARCH DICTIONARY fths (
template=fths_template,
dictionary=english_stem,
dictfile=fths_sample
);
# CREATE TEXT SEARCH DICTIONARY second_ths (
template=thesaurus,
dictionary=english_stem,
dictfile=second_ths
);
# CREATE TEXT SEARCH CONFIGURATION test ( COPY=pg_catalog.english );
# ALTER TEXT SEARCH CONFIGURATION test
ALTER MAPPING FOR asciihword, asciiword, hword, hword_asciipart, hword_part, word
WITH fths, second_ths, english_stem;
dict_fths behaves correctly when the mapping is between a single token and a single lexeme.
fths_sample.ths entries:
ski : sport
second_ths.ths entries:
sport competition : *sporting-event
Output (correct, correct):
# select to_tsvector('test', 'ski');
to_tsvector
---------------
'sport':1
(1 row)
# select to_tsvector('test', 'ski competition');
to_tsvector
---------------
'sporting-event':1
(1 row)
However when I edited the ths files to include phrases, I no longer get the output that I desire:
fths_sample.ths entries:
ski : sport
ski jumping : sport
Output (correct, correct, incorrect, incorrect):
# select to_tsvector('test','ski');
to_tsvector
---------------
'sport':1
(1 row)
# select to_tsvector('test','ski jumping');
to_tsvector
---------------
'sport':1
(1 row)
# select to_tsvector('test' 'ski competition');
to_tsvector
---------------
'sport':1 'competit':2
(1 row)
# select to_tsvector('test', 'ski jumping competition');
to_tsvector
---------------
'sport':1 'competit':2
(1 row)
Even after I edited the fths_sample.ths file, the output is still incorrect:
fths_sample.ths contains:
ski jumping : sport
Here is the output (correct, incorrect):
# select to_tsvector('test', 'ski jumping');
to_tsvector
---------------
'sport':1
(1 row)
# select to_tsvector('test', 'ski jumping competition');
to_tsvector
---------------
'sport':1 'competit':2
(1 row)
So it seems that the thesaurus fails to pass the lexeme when 1) it has more than 1 token 2) it is part of a longer phrase.