Sphinx Search with non-ASCII characters

387 Views Asked by At

I am using Sphinx 3.1.1 (via ThinkingSphinx 4.4.1), with real-time indexes (i.e. they are not directly backed up by MySQL tables).

On my development machine I can successfully search for strings with both ASCII and non-ASCII (accented UTF-8) characters.

The same Sphinx version is deployed on the staging machine, and I'm using the exact same config on both.

However, on the staging machine, my searches only return values if all the characters in the search string are ASCII ones. (Sphinx seems to return the right records in this case.) When the search string includes accented characters, Sphinx returns an empty set. (I can confirm that the record I'm looking for does exist.)

I have rebuilt the indices on both computers but nothing changed. My understanding is that Sphinx's 3.x version is UTF-8-based (but it shouldn't matter anyway as the two computers are running identical versions).

Where can it go wrong?

Note: I can provide the configs if they are of help but they are identical on both computers and thus I think they are irrelevant to the issue.


Update:

I have been able to make it consistently not work :). On my machine if Sphinx runs in a docker image it works with English letters but fails to work with non-(lower)-ASCII characters.

More specifically it seems that it treats those characters as word separators.

A record in Sphinx's table:

mysql> select * from project_core WHERE id=22;
+------+----------------+-----------+--------------------+-----------------------+-------------------------------------------+
| id   | sphinx_deleted | tenant_id | sphinx_internal_id | sphinx_internal_class | name_sort                                 |
+------+----------------+-----------+--------------------+-----------------------+-------------------------------------------+
|   22 |              0 |         1 |                 11 | Project               | Example with gibberishßcharsöąůaround     |
+------+----------------+-----------+--------------------+-----------------------+-------------------------------------------+
1 row in set (0.00 sec)

So far so good. Note the 'ß' between 'gibberish' and 'chars'. Now:

mysql> select * from project_core WHERE id=22 AND MATCH('*ibber*');
+------+----------------+-----------+--------------------+-----------------------+-------------------------------------------+
| id   | sphinx_deleted | tenant_id | sphinx_internal_id | sphinx_internal_class | name_sort                                 |
+------+----------------+-----------+--------------------+-----------------------+-------------------------------------------+
|   22 |              0 |         1 |                 11 | Project               | Example with gibberishßcharsöąůaround     |
+------+----------------+-----------+--------------------+-----------------------+-------------------------------------------+
1 row in set (0.00 sec)

Still happy. But then:

mysql> select * from project_core WHERE id=22 AND MATCH('*shßch*');
Empty set, 1 warning (0.00 sec)

Uh-oh...

mysql> SHOW META;
+---------------+---------------------------------------------------------------+
| Variable_name | Value                                                         |
+---------------+---------------------------------------------------------------+
| warning       | Query word length is less than min infix length. word: 'ch*'  |
| total         | 0                                                             |
| total_found   | 0                                                             |
| time          | 0.000                                                         |
| keyword[0]    | *sh                                                           |
| docs[0]       | 0                                                             |
| hits[0]       | 0                                                             |
| keyword[1]    | ch*                                                           |
| docs[1]       | 0                                                             |
| hits[1]       | 0                                                             |
+---------------+---------------------------------------------------------------+
10 rows in set (0.00 sec)

So it seems to split the string on the 'ß'. If the string on both sides of the special character is long enough, Sphinx returns the record just fine:

mysql> select * from project_core WHERE id=22 AND MATCH('*gibberishßchar*');
+------+----------------+-----------+--------------------+-----------------------+-------------------------------------------+
| id   | sphinx_deleted | tenant_id | sphinx_internal_id | sphinx_internal_class | name_sort                                 |
+------+----------------+-----------+--------------------+-----------------------+-------------------------------------------+
|   22 |              0 |         1 |                 11 | Project               | Example with gibberishßcharsöąůaround     |
+------+----------------+-----------+--------------------+-----------------------+-------------------------------------------+
1 row in set (0.00 sec)

mysql> SHOW META;
+---------------+------------+
| Variable_name | Value      |
+---------------+------------+
| total         | 1          |
| total_found   | 1          |
| time          | 0.000      |
| keyword[0]    | *gibberish |
| docs[0]       | 2          |
| hits[0]       | 2          |
| keyword[1]    | char*      |
| docs[1]       | 2          |
| hits[1]       | 2          |
+---------------+------------+
9 rows in set (0.00 sec)

So it looks like I should convince Sphinx somehow to give justice to all UTF8 chars.

Based on these, can you perhaps provide a solution or hints?


Update 2: The exact same thing happens with manticore (latest docker image). So it seems like something in my config rather than stuff in Sphinx/Manticore.


Update 3: I could make it work with all kinds of chars with manticore using docker and the hint from @Manticore Search (i.e. adding charset_table = non_cjk to the index). I'll move the whole stuff over to the other machine in about a day and see what happens. As it's mostly in a container I'm optimistic. (Still no clue, though, as to what caused the original hiccup.)

0

There are 0 best solutions below