I want to split the given string into alphabet segments that the string contains. So for example, if the following string is given:
Los eventos automovilísticos comenzaron poco después de la construcción exitosa de los primeros automóviles a gasolina. El veloz zorro marrón saltó sobre el perezoso perro.
Motoring events began soon after the construction of the first successful gasoline-fueled automobiles. The quick brown fox jumped over the lazy dog.
Мотори су почели убрзо након изградње првих успешних аутомобила на бензин.Брза смеђа лисица је прескочила лењог пса.
Автомобилните събития започнаха скоро след конструирането на първите успешни автомобили с бензиново гориво. Бързата кафява лисица прескочи мързеливото куче.
自動車イベントは、最初の成功したガソリン燃料自動車の製造直後に始まりました。 素早い茶色のキツネは怠け者の犬を飛び越えました。
بدأت أحداث السيارات بعد وقت قصير من بناء أول سيارة ناجحة تعمل بالبنزين. قفز الثعلب البني السريع فوق الكلب الكسول.
The above text contains spanish, english, serbian, bulgarian, japanese, arabic paragraphs (the order of the languages follows the paragraphs order).
Then, after applying some magic function, I would like to get the following output:
{
"langs": [
{
"alphabet": "latin",
"text": "Los eventos automovilísticos comenzaron poco después de la construcción exitosa de los primeros automóviles a gasolina. El veloz zorro marrón saltó sobre el perezoso perro. Motoring events began soon after the construction of the first successful gasoline-fueled automobiles. The quick brown fox jumped over the lazy dog."
},
{
"alphabet": "cyrillic",
"text": "Мотори су почели убрзо након изградње првих успешних аутомобила на бензин.Брза смеђа лисица је прескочила лењог пса. Автомобилните събития започнаха скоро след конструирането на първите успешни автомобили с бензиново гориво. Бързата кафява лисица прескочи мързеливото куче."
},
{
"alphabet": "japanese",
"text": "自動車イベントは、最初の成功したガソリン燃料自動車の製造直後に始まりました。 素早い茶色のキツネは怠け者の犬を飛び越えました。"
},
{
"alphabet": "arabic",
"text": "بدأت أحداث السيارات بعد وقت قصير من بناء أول سيارة ناجحة تعمل بالبنزين. قفز الثعلب البني السريع فوق الكلب الكسول."
}
]
}
As you see, some of the languages are grouped by their family alphabets. For example, spanish and english paragraphs were grouped as latin, or serbian and bulgarian paragraphs were grouped as cyrillic. This is because it is hard to find a specific language (since most of the letters are shared between languages).
Ideally, my final output should be like this:
{
"langs": [
{
"lang": "spanish",
"text": "Los eventos automovilísticos comenzaron poco después de la construcción exitosa de los primeros automóviles a gasolina. El veloz zorro marrón saltó sobre el perezoso perro."
},
{
"lang": "english",
"text": "Motoring events began soon after the construction of the first successful gasoline-fueled automobiles. The quick brown fox jumped over the lazy dog."
},
{
"lang": "serbian",
"text": "Мотори су почели убрзо након изградње првих успешних аутомобила на бензин.Брза смеђа лисица је прескочила лењог пса."
},
{
"lang": "bulgarian",
"text":"Автомобилните събития започнаха скоро след конструирането на първите успешни автомобили с бензиново гориво. Бързата кафява лисица прескочи мързеливото куче."
},
{
"lang": "japanese",
"text": "自動車イベントは、最初の成功したガソリン燃料自動車の製造直後に始まりました。 素早い茶色のキツネは怠け者の犬を飛び越えました。"
},
{
"lang": "arabic",
"text": "بدأت أحداث السيارات بعد وقت قصير من بناء أول سيارة ناجحة تعمل بالبنزين. قفز الثعلب البني السريع فوق الكلب الكسول."
}
]
}
I need to split the text into sub-strings according to the language. For that I am planning to use cld2
which can split text into sentences, but according to my experiments, it does not do well when the string contains text with mixed alphabets (i.e. cyrillic + japanese etc.). However, cld2
does well on the text with mixed languages that share the family of alphabets (i.e. french + english etc.).
That's why, I am planning to split the text into sub-strings by the family of alphabets, then for each of the family, I will aplly cld2
to predict the specific language.
Another important requirements:
- the mixed languages might not be separated clearly by lines like above example (I did that for the sake of simplicity and to make the problem clear)
- I need to be able to do this 'offline' without connecting to 3rd party servers like google etc. (since there will be tons of data that need to be handled)
I would appreciate any ideas that you might have on the above problems. Thanks in advance.
The following solution makes use of Google Translate. Ensure that you use
pip install googletrans==4.0.0-rc1
to install the 4.0.0 release candidate to avoid potential issues. Other language detection packages at the time of writing, such as langdetect and spacy_langdetect, failed to distinguish Serbian from Macedonian.Note that all language detection modules in my experience conform to ISO 639-1 language codes so the output will make use of these codes. If you need the actual language name (e.g. "Spanish" instead of "es"), you'll have to code a simple loop that makes the conversion using the produced
languageDict
. I believe this, along with creating ajson
style output, is besides the main point of your question so I have opted to omit it.As a side note, should you need to group the various languages based on their alphabet, this is something that can also be done with a simple loop using the the produced
languageDict
. Group the ISO 639-1 language codes under their alphabets and then programmatically categorise the text(s) accordingly.Solution
Output