Is there a way to map english letter(s) (or graphemes) in word from correspondent phoneme(s) in Python?

359 Views Asked by At

e.g. let's assume we have something like:

WOULD | YOU | LIKE | A | CUP | OF | TEA

w ʊ d | j uː | l a ɪ k | ə | k ʌ p | ʊ v | t iː

W UH D | Y UW | L AY K | AH | K AH P | AH V | T IY

And besides that I need to solve P2G problem, I also want to get some mapping of each phoneme and corresponding grapheme (letter or group of letters). Could you please help me to understand whether I can get this P2G correspondance in English using some python tools? Thanks a bunch in advance!

1

There are 1 best solutions below

0
On

You can use CMU pronouncing dictionary and aspell or enchant spell checker. CMU pronouncing dictionary is a list of English words and their pronunciations, where each pronunciation is a list of phonemes.

The pronunciation dictionary can be downloaded in text format here: http://www.speech.cs.cmu.edu/cgi-bin/cmudict

The raw text is not in a very useful format, so it is more convenient to download it already parsed. I used the cmudict.dict file from the CMU pronouncing dictionary on nltk.

You can also use enchant spell checker to check if a string of letters is a word. This is useful because the CMU pronouncing dictionary does not contain all possible words and has some errors.

If you have enchant installed, you can use the following code to test it:

import nltk
from enchant.checker import SpellChecker

# Download CMU pronouncing dictionary using nltk
nltk.download()

# Get list of English phonemes
phonemes = [p for w, ps in nltk.corpus.cmudict.entries() for p in ps]

# Get list of possible English graphemes
graphemes = [c for p in phonemes for c in p if c.isalpha()]

# Check the words 'cup' and 'tea' with the CMU pronouncing dictionary
assert nltk.corpus.cmudict.entries()[('cup',)] == [('cup', ['K', 'AH', 'P'])]
assert nltk.corpus.cmudict.entries()[('tea',)] == [('tea', ['T', 'IY'])]

# Check the words 'cup' and 'tea' with an enchant spell checker
c = SpellChecker('en_US')
c.set_text('cup')
assert c.check()
c.set_text('cup ')
assert not c.check()
c.set_text('tea')
assert c.check()
c.set_text('tea ')
assert not c.check()