Searching for an exact word in multiple languages efficiently using MYSQL

210 Views Asked by At

I have a simple database table which stores id, language and text. I want to do a search for any word/character and look for an exact match. The catch is I have over 10 million rows.

e.g. search for the word i would return rows with the text containing "i" like "information was bad" and "I like dogs".

This also needs to work for stopwords and other languages which don't use whitespace.

My first immediate thought is just to do LOWERCASE(text) LIKE %word% with a lowercase index on text but after googling it seems like it would do a full table scan and I am using planetscale so I have to pay for a fulltable scan which simply cannot work as I will run out of usage quick.

My next thought was a BOOLEAN full text search but then I run into the issue of stop words being ignored in english and having to use an ngram parser for languages like Chinese and then having to work out what language is being submitted and what index should be used.

Does anyone have any better ideas?

1

There are 1 best solutions below

3
On
  • Use CHARACTER SET utf8mb4
  • Use the latest available COLLATION for that charset -- utf8mb4_unicode_520_ci or utf8mb4_0900_ai_ci or something else for the latest MariaDB.
  • Do not use LOWERCASE or LOWER (etc), instead, let the collation take care of such (note the "ci" in the collation name).
  • Yes, you may need ngram instead of FULLTEXT for certain Asian languages.
  • The stoplist can be turned off.
  • The min word length can be changed -- at a cost.
  • Your app code can look at the encoding to decide whether to use ngram of fulltext.
  • This provides a list of hex values: http://mysql.rjweb.org/doc.php/charcoll#diagnosing_charset_issues Note that E3-EA is mostly "wordless" languages.
  • I recommend using app code for making decisions and build the SQL query. It may even degenerate to LIKE '%word%' or REGEXP '\\bword\\b' in some cases. Note that REGEXP is generally slower than LIKE, but provides "word boundary" testing if the search strings contain multiple words.
  • When applicable, FULLTEXT is significantly faster than any other technique.
  • When doing WHERE ... AND MATCH ..., the match will be performed first, even if the rest of the WHERE is more selective.
  • LIKE '%...' and all(?) REGEXP tests will. read and test every one of your 10M rows (unless there is a LIMIT).