When processing text with multiple German sentences like text = 'Frau Dr. Peters ist heute nicht erreichbar. Kommen Sie bitte morgen wieder.'. I want to do some further steps on each sentence. For this I need (in my case) only the tokens as text.
Here's an example:
import spacy
nlp = spacy.load('de_core_news_sm')
text = 'Frau Dr. Peters ist heute nicht erreichbar. Kommen Sie bitte morgen wieder.'
doc = nlp(text)
# Convert Document to list of tokens
def token_to_list(doc):
tokens = []
for token in doc:
tokens.append(token.lower_)
return tokens
sentences = list(doc.sents)
tokens_sent = []
for sent in sentences:
tokens = tokens_to_list(sent.as_doc())
tokens_sent.append(tokens)
print(tokens_sent)
I would expect to see this in my console:
[['frau', 'dr.', 'peters', 'ist', 'heute', 'nicht', 'erreichbar', '.'], ['kommen', 'sie', 'bitte', 'morgen', 'wieder', '.']]
Instead this is the output (I added some format for better visibility):
['frau', 'dr.', 'peters', 'ist', 'heute', 'nicht', 'erreichbar', '.',
['frau', 'dr.', 'peters', 'ist', 'heute', 'nicht', 'erreichbar', '.',
[...],
'kommen', 'sie', 'bitte', 'morgen', 'wieder', '.',
[...]
],
'kommen', 'sie', 'bitte', 'morgen', 'wieder', '.'
['frau', 'dr.', 'peters', 'ist', 'heute', 'nicht', 'erreichbar', '.',
[...],
'kommen', 'sie', 'bitte', 'morgen', 'wieder', '.',
[...]
]
]
As you can see, there seems to be some kind of recursion over my list. Further inspection shows that the [...] element contains the same list of elements as the layer above and continues in itself.
I can't figure out why or how to achieve the expected output.