Automatic Word Boundary Detection for German

24 Views Asked by At

I want to rephrase that: I need a corpus of German words so that I can check if a segment is a word. My solution so far is to take the string, check if it's in the dictionary and if not, delete the last character and check again, and so on. I just need a German word list now. Does anyone know something?

I have a bunch of German texts, but lost all whitespaces. Now I need to perform some kind of word boundary detection to get from "NamensänderungimNamenderIntegration" to ["Namensänderung", "im", "Namen", "der", "Integration"].

I found the python package wordsegment and it works okay, but not ideally. I also found the german_compound_splitter, but that would also split "Namensänderung" in "Namens" "änderung". Does anyone have any experience with that or knows how I could build a solution?

1

There are 1 best solutions below

0
mac On

f the input text is without any spaces and you still need to automatically detect word boundaries for German text, you may need to use a dedicated German word segmentation library or a language model trained specifically for this purpose. One such library you can use is wordsegment, which provides word segmentation functionality.

However, it's worth noting that wordsegment is primarily designed for English, and while it may work for some German text, it might not be as accurate as a model specifically trained for German.

Install the wordsegment library: pip install wordsegment

import wordsegment

text = "IchbineinStudentausDeutschland."

segmented_text = wordsegment.segment(text)

print(segmented_text)