How to separate Phonetic, Word Break and Word Join keywords from list of keywords using python?

60 Views Asked by At

I want to separate phonetic, word break and word join keywords from a list of keywords using Python. Example:

Input List:

rice1kg
oil
cooking oil
oliv oil
flour5kg
buther
baking povder
Leg umes

Expected Output:

rice 1kg - word break
olive oil - phonetic
flour 5kg - word break
butter - phonetic
baking powder - phonetic
legumes - word join
1

There are 1 best solutions below

1
Rufus On

This is a very complicated problem without a definite answer. In it's simplest form it essentially breaks down into categorising strings into three categories:

  • Word Join - An English word split into two parts separated by a space character. ie. wo rd
  • Word Break - Two English words that should be separated by a space character. ie. wordbreak
  • Phonetic - One or more english words spelled incorrectly, but that would be pronounced the same as their correctly spelled counterparts. ie. toylet

There are a large number of problems that would need to be solved to complete this categorisation. I'll list some of them below:

  1. How would you categorise the string pullover. Is that correctly spelled, ie. pullover: a knitted garment, or a word break, ie. pull over? This is just an example, but there are a lot of cases where a string could be placed logically into multiple categories.
  2. How are you defining a correctly spelled word? Is it in the English dictionary? If so, what about measurements, like 200mm, 5kg, 20", etc.
  3. How are you defining phonetic similarity? One approach would be to split each word into syllables, and then translate each syllable into the international phonetic alphabet, which is itself a difficult task and still has its own problems. Is the sound oo pronounced as it is in the word food, or the word good? The English language is stitched together from many sources, with borrowed words and linguistic roots in multiple places, so it doesn't follow a set of fixed rules when pronouncing words. This, in my opinion, is what makes this task so difficult.

I can't offer solutions to these problems, but knowing what challenges you'll face when you approach this task is a good way to start.