In python, how can I distinguish between a human readable word and a random string?

6.8k Views Asked by At

Examples of words:

  1. ball
  2. encyclopedia
  3. tableau

Examples of random strings:

  1. qxbogsac
  2. jgaynj
  3. rnnfdwpm

Of course it may happen that a random string will actually be a word in some language or look like one. But basically a human being is able to say it something looks 'random' or not, basically just by checking if you are able to pronounce it or not.

I was trying to calculate entropy to distinguish those two but it's far from perfect. Do you have any other ideas, algorithms that works?

There is one important requirement though, I can't use heavy-weight libraries like nltk or use dictionaries. Basically what I need is some simple and quick heuristic that works in most cases.

6

There are 6 best solutions below

0
On

Works quite well for me:

VOWELS = "aeiou"
PHONES = ['sh', 'ch', 'ph', 'sz', 'cz', 'sch', 'rz', 'dz']

def isWord(word):
    if word:
        consecutiveVowels = 0
        consecutiveConsonents = 0
        for idx, letter in enumerate(word.lower()):
            vowel = True if letter in VOWELS else False

            if idx:
                prev = word[idx-1]               
                prevVowel = True if prev in VOWELS else False
                if not vowel and letter == 'y' and not prevVowel:
                    vowel = True

                if prevVowel != vowel:
                    consecutiveVowels = 0
                    consecutiveConsonents = 0

            if vowel:
                consecutiveVowels += 1
            else:
                consecutiveConsonents +=1

            if consecutiveVowels >= 3 or consecutiveConsonents > 3:
                return False

            if consecutiveConsonents == 3:
                subStr = word[idx-2:idx+1]
                if any(phone in subStr for phone in PHONES):
                    consecutiveConsonents -= 1
                    continue    
                return False                

    return True
1
On

Caveat I am not a Natural Language Expert

Assuming what ever mentioned in the link If You Can Raed Tihs, You Msut Be Raelly Smrat is authentic, a simple approach would be

  1. Have an English (I believe its language antagonistic) dictionary
  2. Create a python dict of the words, with keys as the first and last character of the words in the dictionary

    words = defaultdict()
    with open("your_dict.txt") as fin:
         for word in fin:
            words[word[0]+word[-1]].append(word)
    
  3. Now for any given word, search the dictionary (remember key is the first and last character of the word)

    for matches in words[needle[0] + needle[-1]]:
    
  4. Compare if the characters in the value of the dictionary and your needle matches

    for match in words[needle[0] + needle[-1]]:
        if sorted(match) == sorted(needle):
             print "Human Readable Word"
    

A comparably slower approach would be to use difflib.get_close_matches(word, possibilities[, n][, cutoff])

0
On

If you really mean that your metric of randomness is pronounceability, you're getting into the realm of phonotactics: the allowed sequences of sounds in a language. As @ChrisPosser points out in his comment to your question, these allowed sequences of sounds are language-specific.

This question only makes sense within a specific language.

Whichever language you choose, you might have some luck with an n-gram model trained over the letters themselves (as opposed to the words, which is the usual approach). Then you can calculate a score for a particular string and set a threshold under which a string is random and over which a string is something like a word.

EDIT: Someone has done this already and actually implemented it: https://stackoverflow.com/a/6298193/583834

1
On

Use PyDictionary. You can install PyDictionary using following command.

easy_install -U PyDictionary

Now in code:

from PyDictionary import PyDictionary
dictionary=PyDictionary()

a = ['ball', 'asdfg']

for item in a:
  x = dictionary.meaning(item)
  if x==None:
    print item + ': Not a valid word'
  else:
    print item + ': Valid'

As far as I know, you can use PyDictionary for some other languages then english.

0
On

I wrote this logic to detect number of consecutive vowels and consonants in a string. You can choose the threshold based on the language.

def get_num_vowel_bunches(txt,num_consq = 3):
    len_txt = len(txt)
    num_viol = 0
    if len_txt >=num_consq:
        pos_iter = re.finditer('[aeiou]',txt)
        pos_mat = np.zeros((num_consq,len_txt),dtype=int)
        for idx in pos_iter:
            pos_mat[0,idx.span()[0]] = 1
        for i in np.arange(1,num_consq):
            pos_mat[i,0:-1] = pos_mat[i-1,1:]
        sum_vec = np.sum(pos_mat,axis=0)
        num_viol = sum(sum_vec == num_consq)
    return num_viol

def get_num_consonent_bunches(txt,num_consq = 3):
    len_txt = len(txt)
    num_viol = 0
    if len_txt >=num_consq:
        pos_iter = re.finditer('[bcdfghjklmnpqrstvwxz]',txt)
        pos_mat = np.zeros((num_consq,len_txt),dtype=int)
        for idx in pos_iter:
            pos_mat[0,idx.span()[0]] = 1
        for i in np.arange(1,num_consq):
            pos_mat[i,0:-1] = pos_mat[i-1,1:]
        sum_vec = np.sum(pos_mat,axis=0)
        num_viol = sum(sum_vec == num_consq)
    return num_viol
2
On

I developed a Python 3 package called Nostril for a problem closely related to what the OP asked: deciding whether text strings extracted during source-code mining are class/function/variable/etc. identifiers or random gibberish. It does not use a dictionary, but it does incorporate a rather large table of n-gram frequencies to support its probabilistic assessment of text strings. (I'm not sure if that qualifies as a "dictionary".) The approach does not check pronunciation, and its specialization may make it unsuitable for general word/nonword detection; nevertheless, perhaps it will be useful for either the OP or someone else looking to solve a similar problem.

Example: the following code,

from nostril import nonsense
real_test = ['bunchofwords', 'getint', 'xywinlist', 'ioFlXFndrInfo',
             'DMEcalPreshowerDigis', 'httpredaksikatakamiwordpresscom']
junk_test = ['faiwtlwexu', 'asfgtqwafazfyiur', 'zxcvbnmlkjhgfdsaqwerty']
for s in real_test + junk_test:
    print('{}: {}'.format(s, 'nonsense' if nonsense(s) else 'real'))

will produce the following output:

bunchofwords: real
getint: real
xywinlist: real
ioFlXFndrInfo: real
DMEcalPreshowerDigis: real
httpredaksikatakamiwordpresscom: real
faiwtlwexu: nonsense
asfgtqwafazfyiur: nonsense
zxcvbnmlkjhgfdsaqwerty: nonsense