Process subset of a Document with spacy

110 Views Asked by At

When processing text with multiple German sentences like text = 'Frau Dr. Peters ist heute nicht erreichbar. Kommen Sie bitte morgen wieder.'. I want to do some further steps on each sentence. For this I need (in my case) only the tokens as text.

Here's an example:

import spacy

nlp = spacy.load('de_core_news_sm')
text = 'Frau Dr. Peters ist heute nicht erreichbar. Kommen Sie bitte morgen wieder.'

doc = nlp(text)

# Convert Document to list of tokens
def token_to_list(doc):
   tokens = []
   for token in doc:
      tokens.append(token.lower_)
   return tokens

sentences = list(doc.sents)
tokens_sent = []
for sent in sentences:
   tokens = tokens_to_list(sent.as_doc())
   tokens_sent.append(tokens)

print(tokens_sent)

I would expect to see this in my console:

[['frau', 'dr.', 'peters', 'ist', 'heute', 'nicht', 'erreichbar', '.'], ['kommen', 'sie', 'bitte', 'morgen', 'wieder', '.']]

Instead this is the output (I added some format for better visibility):

['frau', 'dr.', 'peters', 'ist', 'heute', 'nicht', 'erreichbar', '.', 
   ['frau', 'dr.', 'peters', 'ist', 'heute', 'nicht', 'erreichbar', '.', 
      [...], 
   'kommen', 'sie', 'bitte', 'morgen', 'wieder', '.',
      [...]
   ], 
'kommen', 'sie', 'bitte', 'morgen', 'wieder', '.'
   ['frau', 'dr.', 'peters', 'ist', 'heute', 'nicht', 'erreichbar', '.', 
      [...], 
   'kommen', 'sie', 'bitte', 'morgen', 'wieder', '.',
      [...]
   ]
] 

As you can see, there seems to be some kind of recursion over my list. Further inspection shows that the [...] element contains the same list of elements as the layer above and continues in itself. I can't figure out why or how to achieve the expected output.

0

There are 0 best solutions below