I have a simple database table which stores id, language and text. I want to do a search for any word/character and look for an exact match. The catch is I have over 10 million rows.
e.g. search for the word i would return rows with the text containing "i" like "information was bad" and "I like dogs".
This also needs to work for stopwords and other languages which don't use whitespace.
My first immediate thought is just to do LOWERCASE(text) LIKE %word%
with a lowercase index on text but after googling it seems like it would do a full table scan and I am using planetscale so I have to pay for a fulltable scan which simply cannot work as I will run out of usage quick.
My next thought was a BOOLEAN full text search but then I run into the issue of stop words being ignored in english and having to use an ngram parser for languages like Chinese and then having to work out what language is being submitted and what index should be used.
Does anyone have any better ideas?
CHARACTER SET utf8mb4
COLLATION
for that charset --utf8mb4_unicode_520_ci
orutf8mb4_0900_ai_ci
or something else for the latest MariaDB.LOWERCASE
orLOWER
(etc), instead, let the collation take care of such (note the "ci" in the collation name).FULLTEXT
for certain Asian languages.LIKE '%word%'
orREGEXP '\\bword\\b'
in some cases. Note that REGEXP is generally slower than LIKE, but provides "word boundary" testing if the search strings contain multiple words.FULLTEXT
is significantly faster than any other technique.WHERE ... AND MATCH ...
, the match will be performed first, even if the rest of the WHERE is more selective.LIKE '%...'
and all(?)REGEXP
tests will. read and test every one of your 10M rows (unless there is aLIMIT
).