Tough regex question: I want to use regexes to extract information from news sentences about crackdowns. Here are some examples:
doc1 = "5 young students arrested"
doc2 = "10 rebels were reported killed"
I want to match sentences based on lists of entities and outcomes:
entities = ['students','rebels']
outcomes = ['arrested','killed']
How can I use a regex to extract the number of participants from 0-99999, any of the entities, any of the outcomes, all while ignoring random text (such as 'young' or 'were reported')? This is what I have:
re.findall(r'\d{1,5} \D{1,50}'+ '|'.join(entities) + '\D{1,50}' + '|'.join(outcomes),doc1)
i.e., a number, some optional random text, an entity, some more optional random text, and an outcome. Something is going wrong, I think because of the OR statements. Thanks for your help!
This regex should match your two examples:
What you were missing were parentheses around the ORs.
However, using only regex likely won't give you good results. Consider using Natural Language Processing libraries like NLTK that parses sentences.