python regex to match any valid english sentence

622 Views Asked by At

I was wondering if it is possible to write a python regex to match it up with any valid English sentence which can have alphanumeric characters and special characters.
Basically, I wanted to extract some specific elements from an XML file. These specific elements will have the following form:

<p o=<Any Number>> <Any English sentence> </p>  

For example:

<p o ="1"> The quick brown fox jumps over the lazy dog </p>

or

<p o ="2">  And This is a number 12.90! </p>

We can easily write regex for

<p o=<Any Number>>

and </p> tags. But I am interested in extracting the sentences lying in between these tags by writing regex group.

Can anyone please suggest a Regex to be used for the problem above?

Also, if you can suggest a workaround approach, then it will be really helpful to me as well.

2

There are 2 best solutions below

0
On BEST ANSWER

Use an XML parser like lxml, regex is not suitable for this task. Example:

import lxml.etree
// First we parse the xml
doc = lxml.etree.fromstring('<p o ="2">  And This is a number 12.90! </p>')
// Then we use xpath to extract the element we need
doc.xpath('/p/text()')

You can read more about XPATH at: Xpath tutorial.

0
On

You should use an xml parser really. Example here http://www.travisglines.com/web-coding/python-xml-parser-tutorial.