MySQL regexp with Japanese furigana

798 Views Asked by At

I have a large database (~2700 entries) of vocabulary. Each row contains an English word, the Japanese equivalent, and other data not relevant to this problem. I have created a facility to search and display the results in a table, but I'm having a small problem with the furigana.

Japanese sentences are written with a mix of Chinese characters (kanji) and the phonetic scripts (kana). Not everyone can read every kanji, and sometimes the same kanji has multiple readings. In those cases, the phoetic kana is placed above the kanji - this is called furigana:

enter image description here

I present these phonetic readings to the user with the <ruby> tag in the following format:

<ruby>
  <rb>勉強</rb>    <!-- the kanji -->
  <rp>(</rp>      <!-- define where the phonetic part starts in the string -->
    <rt>べんきょう</rt>   <!-- the phonetic kana itself -->
  <rp>)</rp>      <!-- define the end of the phonetic part -->
</ruby>する        <!-- the last part is already phonetic so needs no ruby -->

The strings are stored in my database like this:

勉強(べんきょう)する

where anything between the parentheses is the reading for the kanji immediately preceeding it. Storing the strings this way allows fallback for browsers that don't support ruby tags (such as, amazingly, Firefox).

All of this is fine, but the problem comes when a user is searching. If they search for

勉強

Then it will show up. But if they try to search for

勉強する

it won't work, because in the database there is a string defining the phonetic pronunciation in the middle.

The full-width parentheses in the above example are used only to denote this phonetic script. Given this, I am looking for a way to essentially tell the MySQL search to ignore anything it finds between rounded parentheses. I have a basic knowledge of how to do most simple queries in MySQL, but I'm certainly not an expert. I have looked at the docs, but (to me, at least) they are not very user-friendly. Perhaps not very beginner-friendly. I thought it might be possible with some sort of construction involving a regular expression, but I can't figure out how.

Is there a way to do what I want?

2

There are 2 best solutions below

2
On BEST ANSWER

As said in How to do a regular expression replace in MySQL?, there seems to be impossible without an user-defined function (you can only replace explicit sequences).

Rather dirty solution: you can tolerate anything between two consecutive Japanese characters, LIKE '勉%強%す%る'. I never suggested that.

Or, you can keep an optional field in your table that potentially contains a version with furigana.

2
On

I would advise against using LIKE queries beause you would have to have a % between every single character (since you don't know WHEN furigana will occur) and that could end up creating false positives (like if a valid character appeared between 勉 and 強).

As @Jill-Jênn Vie breifly mentioned, I'd suggest adding a new column to hold the text with furigana.

I'm working on an application which performs searches on Korean text. The problem is that Korean conjugation changes the characters. For example:

하다 + 아요 = 해요

"하다" is the verb "to do" in dictionary form and "아요" is the standard polite-form conjugation. Presumably you are a Japanese speaker, so you know how common such polite forms can be! Note how the 하 changes to 해. Obviously, if users try to search for "하다" in the string "해요", they won't find it. But if users want to see all instances of "하다" in the corpus, we need to be able to return it.

Our solution was two columns: "form" (conjugated form) and "analytic_string" which would represent "해요" as "하다+아요". You could take a similar approach and make a second column containing your sentence without furigana.

The main disadvantages of this approach is that you're effectively doubling your database size and you need to pay special attention when inputting data that the two column have the same data (I found a few rows in my database where the form and the analytic string have different words in them). The advantage is you can easily search your data while ignoring furigana.

It's your standard "size vs. performance" trade-off. Which is more important: size of the database or execution time? Any other solution I can think of involves returning too many rows and then individually analyzing them.