Bug with PostgreSQL full text search + multi-byte UTF-8 characters + plperl extension

110 Views Asked by At

I'm facing this problem with PostgreSQL 15.4 but not 14. It happens when all 3 conditions are satisfied:

  1. TEXT @@ to_tsquery() is executed.
  2. The TEXT column contains a multi-byte UTF-8 character.
  3. The statement is run after a simple plperl function call in the same session.

The error is:

ERROR:  invalid multibyte character for locale
HINT:  The server's LC_CTYPE locale is probably incompatible with the database encoding.

Reproduction:

$ createdb -U postgres test
$ psql -U postgres test
psql (15.4)
Type "help" for help.

test=# \l test
                                             List of databases
 Name |  Owner   | Encoding |   Collate   |    Ctype    | ICU Locale | Locale Provider | Access privileges 
------+----------+----------+-------------+-------------+------------+-----------------+-------------------
 test | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 |            | libc            | 
(1 row)

test=# CREATE EXTENSION IF NOT EXISTS plperl;
CREATE EXTENSION
test=# CREATE FUNCTION public.simple_quoter(text) RETURNS TEXT
    LANGUAGE plperl
    AS $_X$
return "'$_[0]'"
$_X$;
CREATE FUNCTION
test=# CREATE TABLE t (c TEXT);
CREATE TABLE
test=# INSERT INTO t VALUES ('Queensrÿche');
INSERT 0 1
test=# SELECT * FROM t WHERE to_tsquery('english','Queen:*') @@ to_tsvector('english',c);
ERROR:  invalid multibyte character for locale
HINT:  The server's LC_CTYPE locale is probably incompatible with the database encoding.

Now that the plperl extension and simple_quote() function are created, the simple repro.sql file can be created:

SELECT simple_quoter('hi');

SELECT c @@ to_tsquery('english','Queen:*') FROM t;

which also reproduces the error:

$ psql -U postgres test < repro.sql
 simple_quoter 
---------------
 'hi'
(1 row)

ERROR:  invalid multibyte character for locale
HINT:  The server's LC_CTYPE locale is probably incompatible with the database encoding.

When run in its own session, it is fine:

test=# SELECT c @@ to_tsquery('english','Queen:*') FROM t;
 ?column? 
----------
 t
(1 row)

When initdb was run, no special parameters were given for locale:

$ initdb -D /var/lib/postgres/data

The locale settings were the same as they are now:

$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
test=# SHOW CLIENT_ENCODING;
 client_encoding 
-----------------
 UTF8
(1 row)

test=# SHOW SERVER_ENCODING;
 server_encoding 
-----------------
 UTF8
(1 row)

Questions

Is this a bug with PostgreSQL 15 or am I overlooking something?

Is there a workaround (without having to not use plperl in the same session)?


Additional Details

OS: Current version of arch linux (2023-Sept-3)

PostgreSQL build: https://gitlab.archlinux.org/archlinux/packaging/packages/postgresql/-/blob/main/PKGBUILD pkgver=15.4 pkgrel=2

0

There are 0 best solutions below