I am trying to run the below code bit to extract Yelp review
from lxml import html
import requests
import csv
page = requests.get('http://www.yelp.com/biz/guisados-los-angeles')
review = tree.xpath('//p[@itemprop="description"]/text()')
Now, I have a review as below
These tacos are the business. We ventured into an unpretentious, relatively small restaurant who offered a photographic menu (VERY helpful) of the different tacos they have.
The above review single review is being split as the list below
[
'These tacos are the business.',
'We ventured into an unpretentious, relatively small restaurant who offered a photographic menu (VERY helpful) of the different tacos they have.
]
How do I get lxml text()
to ignore the <br>
in the comment? Any pointers, please?
As far as I understand, you want each review text as a single string.
Iterate over
p
elements withitemprop="description"
and get the.text_content()
:Prints:
Note that there are no spaces and newlines preserved in the review text. This is something you can fix (if needed), see: