Lexemes produced by simple dictionary in PostgreSQL are not lower cased for non-Latin languages after upgrade from v.9 to v.11

Question

Lexemes produced by simple dictionary in PostgreSQL are not lower cased for non-Latin languages after upgrade from v.9 to v.11

393 Views Asked by Dmytro Savochkin At 25 November 2020 at 11:30

In our project we use full-text search feature of PostgreSQL for different languages (English, German, Greek, Ukrainian, etc) using simple dictionary. On my local machine (macOS Mojave) I have PostgreSQL installed using Homebrew. After trying to upgrade PostgreSQL from version 9.5.23 to 11.10 (basically I just removed old version and installed it anew) I noticed that when I try to use to_tsvector function to convert my text to tsvector data type it stopped lower-casing the text when using non-Latin languages. It still lower-cases the text producing lower-cased lexemes if it's Latin-based (ASCII). The problem only appeared though after I had upgraded to version 11 (it still worked as expected after I had transitionally upgraded from 9.5.23 to 10.15).

Examples (on versions 11.10, 12.5, 13.1):

SELECT to_tsvector('simple', 'Офісний Менеджер');
'Менеджер':2 'Офісний':1

SELECT to_tsvector('simple', 'Ελληνικά');
'Ελληνικά':1

SELECT ts_lexize('simple', 'Менеджер');
{Менеджер}

SELECT ts_lexize('simple', 'Manager');
{manager}

This is how it was previously (on versions 9.5.23 and 10.15):

SELECT to_tsvector('simple', 'Офісний Менеджер');
'менеджер':2 'офісний':1

SELECT to_tsvector('simple', 'Ελληνικά')
'ελληνικά':1

SELECT ts_lexize('simple', 'Менеджер');
{менеджер}

SELECT ts_lexize('simple', 'Manager');
{manager}

I tried reading all the release notes but I couldn't see anything related to my problem. In the docs it is still stated that

The simple dictionary template operates by converting the input token to lower case...

so I assume there were no changes to the API itself.

I have a suspicion that in version 11 they changed some default configuration parameters which affect this but I couldn't find anything helpful.

Update:

Running \l in psql shell gives me the following.

# 11.10

    List of databases
        Name        |      Owner      | Encoding | Collate | Ctype |          Access privileges          
--------------------+-----------------+----------+---------+-------+-------------------------------------
 postgres           | dmytrosavochkin | UTF8     | C       | C     | 
 template0          | dmytrosavochkin | UTF8     | C       | C     | =c/dmytrosavochkin                 +
                    |                 |          |         |       | dmytrosavochkin=CTc/dmytrosavochkin
 template1          | dmytrosavochkin | UTF8     | C       | C     | =c/dmytrosavochkin                 +
                    |                 |          |         |       | 

# 9.5.23

     List of databases
        Name        |      Owner      | Encoding |   Collate   |    Ctype    |          Access privileges          
--------------------+-----------------+----------+-------------+-------------+-------------------------------------
 postgres           | dmytrosavochkin | UTF8     | en_US.UTF-8 | en_US.UTF-8 | 
 template0          | dmytrosavochkin | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/dmytrosavochkin                 +
                    |                 |          |             |             | dmytrosavochkin=CTc/dmytrosavochkin
 template1          | dmytrosavochkin | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/dmytrosavochkin                 +
                    |                 |          |             |             | dmytrosavochkin=CTc/dmytrosavochkin

Original Q&A

There are 1 best solutions below

**Dmytro Savochkin** · Accepted Answer · 2020-11-25T14:24:40.430000

Okay, thanks to the comment of @jjanes I got it (thank you!).

The problem happened to be in the set up of Homebrew postgresql formulae. Basically, when I was installing [email protected] it was running initdb /usr/local/var/[email protected] under the hood to initialize the cluster but when I switched to 11.10 it was running initdb --locale=C -E UTF-8 /usr/local/var/postgresql@10 setting C as a default locale and I didn't notice it.

To fix this I removed the initial data by rm -rf /usr/local/var/postgresql@12 and then re-initialized it manually with initdb --locale=en_US.UTF-8 -E UTF-8 /usr/local/var/postgresql@12. Now the default collate for all databases is en_US.UTF-8 and everything works exactly as before.

Lexemes produced by simple dictionary in PostgreSQL are not lower cased for non-Latin languages after upgrade from v.9 to v.11

Update:

There are 1 best solutions below

Related Questions in POSTGRESQL

Related Questions in FULL-TEXT-SEARCH

Related Questions in HOMEBREW

Related Questions in COLLATION

Related Questions in POSTGRESQL-11

Trending Questions

Popular # Hahtags

Popular Questions