Parsing Yelp using lxml - ignore html tag

Question

Parsing Yelp using lxml - ignore html tag

383 Views Asked by Arun At 22 December 2014 at 02:33

I am trying to run the below code bit to extract Yelp review

from lxml import html  
import requests  
import csv  
page = requests.get('http://www.yelp.com/biz/guisados-los-angeles')

review = tree.xpath('//p[@itemprop="description"]/text()')

Now, I have a review as below

These tacos are the business.

We ventured into an unpretentious, relatively small restaurant who offered a photographic menu (VERY helpful) of the different tacos they have.

The above review single review is being split as the list below

[
    'These tacos are the business.', 
    'We ventured into an unpretentious, relatively small restaurant who offered a photographic menu (VERY helpful) of the different tacos they have.
]

How do I get lxml text() to ignore the <br> in the comment? Any pointers, please?

Original Q&A

There are 1 best solutions below

**alecxe** · Accepted Answer · 2014-12-22T02:44:35.637000

As far as I understand, you want each review text as a single string.

Iterate over p elements with itemprop="description" and get the .text_content():

for review in tree.xpath('//p[@itemprop="description"]'):
    print review.text_content()  # alternatively: ' '.join(review.xpath('text()'))
    print "----"

Prints:

These tacos are the business.We ventured into an unpretentious, relatively small restaurant who offered a photographic menu (VERY helpful) of the different tacos they have.  The friendliest of cashiers and servers greeted us.  My group and I each got the sampler with additional pescado and camarones tacos, and a quesadilla.  We pretty much ordered the whole menu and they were patient as we picked out the individual tacos for our samplers.  To echo what others have said, the corn tortillas are BOMB.com, but the braised meats also hold their own.  The whole experience is one amazing party in your mouth.  Their horchata is also a must-order.The habanero salsa that they have (I forget what they call it) really is a thing of beauty.  The spice kicks you on the tongue well after the salsa has slid down your throat.  If you're a chili eater like me, you need to get an extra side of this!  It won't let you down.  Also, they serve Stumptown coffee!  A+
----
We've been meaning to pay this place a visit for over a year. Now I wonder why we waited so long? This isn't your "traditional" taqueria, so if you're craving street taco style food, go to Tacos Gavilan. If you want a trip down memory lane with the bursting flavors and spices of old-school guisos in the form of tacos, then this is the spot. My partner and I each had a sampler, each trying a different variation of the plate.All of the options are delicious, the only one that left me feeling just a bit disappointed was the tacos de calabasitas, it was ok, but it just felt a bit bland. If you like spicy, try the tinga. The cochinita pibil also comes in a very spicy sauce, but for the sampler they use a very mild sauce. My personal favorite was the Hongos con Cilantro. Yes, it's meatless and I'm addicted. Pure perfection, bursting with flavor. To drink, we ordered an Horchata and  an Armando Palmero. The latter is their version of an Arnold Palmer, mixing Jamaica with Lemonade, it was good and refreshing, best as a summer drink. However, I'm hooked on their horchata, this is the real deal, none of that nasty powdered fake horchata. This horchata is sweetened just right and you can taste and feel the graininess of the toasted rice, pure oldschool deliciousness. Best Horchata in LA! Overall, we plan on returning here and recommend you try it at least once.
----
...

Note that there are no spaces and newlines preserved in the review text. This is something you can fix (if needed), see:

How can I preserve <br> as newlines with lxml.html text_content() or equivalent?

Parsing Yelp using lxml - ignore html tag

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in WEB-SCRAPING

Related Questions in LXML

Related Questions in YELP

Related Questions in LXML.HTML

Trending Questions

Popular # Hahtags

Popular Questions