Split string into segments according to the alphabet

121 Views Asked by At

I want to split the given string into alphabet segments that the string contains. So for example, if the following string is given:

Los eventos automovilísticos comenzaron poco después de la construcción exitosa de los primeros automóviles a gasolina. El veloz zorro marrón saltó sobre el perezoso perro.

Motoring events began soon after the construction of the first successful gasoline-fueled automobiles. The quick brown fox jumped over the lazy dog.

Мотори су почели убрзо након изградње првих успешних аутомобила на бензин.Брза смеђа лисица је прескочила лењог пса.

Автомобилните събития започнаха скоро след конструирането на първите успешни автомобили с бензиново гориво. Бързата кафява лисица прескочи мързеливото куче.

自動車イベントは、最初の成功したガソリン燃料自動車の製造直後に始まりました。 素早い茶色のキツネは怠け者の犬を飛び越えました。

بدأت أحداث السيارات بعد وقت قصير من بناء أول سيارة ناجحة تعمل بالبنزين. قفز الثعلب البني السريع فوق الكلب الكسول.

The above text contains spanish, english, serbian, bulgarian, japanese, arabic paragraphs (the order of the languages follows the paragraphs order).

Then, after applying some magic function, I would like to get the following output:

{
    "langs": [
        {
            "alphabet": "latin",
            "text": "Los eventos automovilísticos comenzaron poco después de la construcción exitosa de los primeros automóviles a gasolina. El veloz zorro marrón saltó sobre el perezoso perro. Motoring events began soon after the construction of the first successful gasoline-fueled automobiles. The quick brown fox jumped over the lazy dog."
        },
        {
            "alphabet": "cyrillic",
            "text": "Мотори су почели убрзо након изградње првих успешних аутомобила на бензин.Брза смеђа лисица је прескочила лењог пса. Автомобилните събития започнаха скоро след конструирането на първите успешни автомобили с бензиново гориво. Бързата кафява лисица прескочи мързеливото куче."
        },
        {
            "alphabet": "japanese",
            "text": "自動車イベントは、最初の成功したガソリン燃料自動車の製造直後に始まりました。 素早い茶色のキツネは怠け者の犬を飛び越えました。"
        },
        {
            "alphabet": "arabic",
            "text": "بدأت أحداث السيارات بعد وقت قصير من بناء أول سيارة ناجحة تعمل بالبنزين. قفز الثعلب البني السريع فوق الكلب الكسول."
        }
    ]
}

As you see, some of the languages are grouped by their family alphabets. For example, spanish and english paragraphs were grouped as latin, or serbian and bulgarian paragraphs were grouped as cyrillic. This is because it is hard to find a specific language (since most of the letters are shared between languages).

Ideally, my final output should be like this:

{
    "langs": [
        {
            "lang": "spanish",
            "text": "Los eventos automovilísticos comenzaron poco después de la construcción exitosa de los primeros automóviles a gasolina. El veloz zorro marrón saltó sobre el perezoso perro."
        },
        {
            "lang": "english",
            "text": "Motoring events began soon after the construction of the first successful gasoline-fueled automobiles. The quick brown fox jumped over the lazy dog."
        },
        {
            "lang": "serbian",
            "text": "Мотори су почели убрзо након изградње првих успешних аутомобила на бензин.Брза смеђа лисица је прескочила лењог пса."
        },
        {
            "lang": "bulgarian",
            "text":"Автомобилните събития започнаха скоро след конструирането на първите успешни автомобили с бензиново гориво. Бързата кафява лисица прескочи мързеливото куче."
        },
        {
            "lang": "japanese",
            "text": "自動車イベントは、最初の成功したガソリン燃料自動車の製造直後に始まりました。 素早い茶色のキツネは怠け者の犬を飛び越えました。"
        },
        {
            "lang": "arabic",
            "text": "بدأت أحداث السيارات بعد وقت قصير من بناء أول سيارة ناجحة تعمل بالبنزين. قفز الثعلب البني السريع فوق الكلب الكسول."
        }
    ]
}

I need to split the text into sub-strings according to the language. For that I am planning to use cld2 which can split text into sentences, but according to my experiments, it does not do well when the string contains text with mixed alphabets (i.e. cyrillic + japanese etc.). However, cld2 does well on the text with mixed languages that share the family of alphabets (i.e. french + english etc.).

That's why, I am planning to split the text into sub-strings by the family of alphabets, then for each of the family, I will aplly cld2 to predict the specific language.

Another important requirements:

  • the mixed languages might not be separated clearly by lines like above example (I did that for the sake of simplicity and to make the problem clear)
  • I need to be able to do this 'offline' without connecting to 3rd party servers like google etc. (since there will be tons of data that need to be handled)

I would appreciate any ideas that you might have on the above problems. Thanks in advance.

1

There are 1 best solutions below

3
On

The following solution makes use of Google Translate. Ensure that you use pip install googletrans==4.0.0-rc1 to install the 4.0.0 release candidate to avoid potential issues. Other language detection packages at the time of writing, such as langdetect and spacy_langdetect, failed to distinguish Serbian from Macedonian.

Note that all language detection modules in my experience conform to ISO 639-1 language codes so the output will make use of these codes. If you need the actual language name (e.g. "Spanish" instead of "es"), you'll have to code a simple loop that makes the conversion using the produced languageDict. I believe this, along with creating a json style output, is besides the main point of your question so I have opted to omit it.

As a side note, should you need to group the various languages based on their alphabet, this is something that can also be done with a simple loop using the the produced languageDict. Group the ISO 639-1 language codes under their alphabets and then programmatically categorise the text(s) accordingly.

Solution

from googletrans import Translator
from collections import defaultdict

text = """
Los eventos automovilísticos comenzaron poco después de la construcción exitosa de los primeros automóviles a gasolina. El veloz zorro marrón saltó sobre el perezoso perro.

Motoring events began soon after the construction of the first successful gasoline-fueled automobiles. The quick brown fox jumped over the lazy dog.

Мотори су почели убрзо након изградње првих успешних аутомобила на бензин.Брза смеђа лисица је прескочила лењог пса.

An additional English sentence to see how it handles this.

Автомобилните събития започнаха скоро след конструирането на първите успешни автомобили с бензиново гориво. Бързата кафява лисица прескочи мързеливото куче.

自動車イベントは、最初の成功したガソリン燃料自動車の製造直後に始まりました。 素早い茶色のキツネは怠け者の犬を飛び越えました。

بدأت أحداث السيارات بعد وقت قصير من بناء أول سيارة ناجحة تعمل بالبنزين. قفز الثعلب البني السريع فوق الكلب الكسول.
"""

translator = Translator()  # Instantiate google translator
languageDict = defaultdict(list)  # Create default dictionary to elegantly store results
for line in text.splitlines():  # Iterate over text split by lines
    if line != '':  # Ignore blank lines
        detectedLang = translator.detect(line).lang  # Detect language
        languageDict[detectedLang].append(line)  # Store line under corresponding language key
print(dict(languageDict))

Output

{
'es': ['Los eventos automovilísticos comenzaron poco después de la construcción exitosa de los primeros automóviles a gasolina. El veloz zorro marrón saltó sobre el perezoso perro.'], 
'en': ['Motoring events began soon after the construction of the first successful gasoline-fueled automobiles. The quick brown fox jumped over the lazy dog.', 'An additional English sentence to see how it handles this.'], 
'sr': ['Мотори су почели убрзо након изградње првих успешних аутомобила на бензин.Брза смеђа лисица је прескочила лењог пса.'], 
'bg': ['Автомобилните събития започнаха скоро след конструирането на първите успешни автомобили с бензиново гориво. Бързата кафява лисица прескочи мързеливото куче.'], 
'ja': ['自動車イベントは、最初の成功したガソリン燃料自動車の製造直後に始まりました。 素早い茶色のキツネは怠け者の犬を飛び越えました。'], 
'ar': ['بدأت أحداث السيارات بعد وقت قصير من بناء أول سيارة ناجحة تعمل بالبنزين. قفز الثعلب البني السريع فوق الكلب الكسول.']
}