How to fix langdetect's instable results

1.5k Views Asked by At

I'd like to detect languages in texts using langdetect. According to the documentation , I have to set a seed to get stable results.

Language detection algorithm is non-deterministic, which means that if you try to run it on a text which is either too short or too ambiguous, you might get different results everytime you run it. To enforce consistent results, call following code before the first language detection:

As shown below, the results seems not to work. What did I miss?

from langdetect import detect, detector_factory, detect_langs

my_string = "Hi, my friend lives next to me. Can you call her? Thibault François. Envoyé depuis mon mobile"

detector_factory.seed = 42

for i in range(5):
    print(detect_langs(my_string), detect(my_string))

result example:

[fr:0.7142820855500301, en:0.28571744799229243] en
[fr:0.7142837342663328, en:0.2857140098811736] en
[en:0.571427940246422, fr:0.4285710874902514] fr
[en:0.5714284102904427, fr:0.42857076299207464] fr
[en:0.5714277269187811, fr:0.4285715961184375] fr
2

There are 2 best solutions below

0
On BEST ANSWER

If you use DetectorFactory (as suggested in the documentation) instead of detector_factory, it works.

from langdetect import detect, DetectorFactory, detect_langs

my_string = "Hi, my friend lives next to me. Can you call her? Thibault François. Envoyé depuis mon mobile"

DetectorFactory.seed = 42

for i in range(5):
    print(detect_langs(my_string), detect(my_string))

result:

[en:0.5714271973455635, fr:0.42857096898887964] en
[en:0.5714271973455635, fr:0.42857096898887964] en
[en:0.5714271973455635, fr:0.42857096898887964] en
[en:0.5714271973455635, fr:0.42857096898887964] en
[en:0.5714271973455635, fr:0.42857096898887964] en
0
On

Spacy is confused by this sentence, as it should be. The problem is figuring that out. Setting the seed gives something that is stable but may still be inconsistent. Consider the following very slight twist on that code:

for i in range(5):
    DetectorFactory.seed = 42+i
    print(detect_langs(my_string), detect(my_string))

Every time I run this, I get

[en:0.5714271973455635, fr:0.4285709689888797] en
[fr:0.7142849688010372, en:0.2857145735373333] fr
[fr:0.7142834322119054, en:0.2857163285762464] fr
[fr:0.5714278163020392, en:0.4285693437919268] fr
[fr:0.9999946932803276] fr

So if you had started off with a seed of 46 instead of 42, langdetect would have told you "I'm really sure that is French". This sort of inconsistent behavior seems to happen a lot with text that is equally split between two languages. The best strategy I could come up with to deal with this was the following:

  1. N times (N = 5 or 7 or ...) set DetectorFactory.seed in some stable way, run detect_langs() and remember the result.
  2. If the N top languages are not all the same, conclude that Spacy is confused, perhaps because of multiple languages (as is the case here) or because the text was too short.
  3. If the languages are all the same, look at the median score (or minimum or ...) If that is too low, conclude also that Spacy is confused.
  4. Accept Spacy's result.

To emphasize Spacy's confusion: If I use 89 as the seed, detect_langs returns

[en:0.7142830387547032, fr:0.2857155716263734]

Finally, this discussion applies to use of a pipeline. Without setting the seed, something like this:

    doc = nlp(my_string)
    print(doc._.language["score"])
    print(doc._.language["score"])

may print two different values for the score.