Scrapy xpath construction for tables of data - yielding empty brackets

306 Views Asked by At

I am attempting to build out xpath constructs for data items I would like to extract from several hundred pages of a site that are all formatted the same. An example site is https://weedmaps.com/dispensaries/cannabicare

As can be seen the site has headings and within those headings are rows of item names and prices. I am trying to extract the sections, the item names, and the item prices whether its per gram, 8th, ounce or for edibles the price per unit and keep them all categorized. Some example scrapy item fields are the following:

Sativa_Item_Name=scrapy.Field()
Sative_item_price_gra,=scrapy.Field()
Sativa_item_price_eigth=scrapy.Field()
Sativa_item_price_quarter=scrapy.Field()
Edible_Item_Name=scrapy.Field()
Edible_item_Price_Each=scrapy.Field()

And so on and so forth. I am able to extract all item names and all price/gram with xpaths such as the following:

response.xpath('.//div/span[@class="item_name"]/text()'].extract()
response.xpath('//div[@data-price-name="price_gram"]/span/text()').extract()

I can't figure out how to extract just items within the heading containers, like just the price per gram for items in the Hybrid category, price for each item and the item name in the Edible Category.

They are separated such as id="menu_item_category_4" but when I do something like:

response.xpath('//div[@id="menu_item_category_4"]/span[@class="item_name"]/text()').extract()

it yields empty brackets and no results. Any guidance on this would be beyond appreciated. Thank you so much for taking the time to look at this!

2

There are 2 best solutions below

0
On BEST ANSWER

The thing is that what you see in your browser is after Javascript has formatted stuff, presumably Angular.

If you run the HTML source in a HTML source beautifier, and search for <span class="item_name"> you'll see a pattern like this, repeating blocks of

<div class="menu_item" data-category-id="1" data-category-name="Indica" data-json="{}" id="menu_item_5390083" style="position: relative; overflow: visible;">
    <div class="js-edit"><a class="btn" href="/new_admin/dispensaries/cannabicare/menu_items/banana-og-member-pricing/edit"><i class="icon-edit">Edit</i></a></div>
    <div class="menu-item-form-container js-form" style="display: none;"></div>
    <div class="menu-item-content js-content">
        <div class="row">
            <div class="col-md-4 name"><span class="item_name">Banana OG - Member Pricing</span></div>
            <div class="col-md-8 js-prices prices menu-item-prices">
                <div class="col-sm-2 col-md-2 price-container" data-price-name="price_gram"><span class="price">9 </span><span class="price-label">g</span></div>
                <div class="col-sm-2 col-md-2 price-container" data-price-name="price_eighth"><span class="price">30 </span><span class="price-label">1/8</span></div>
                <div class="col-sm-2 col-md-2 price-container" data-price-name="price_quarter"><span class="price">60 </span><span class="price-label">1/4</span></div>
                <div class="col-sm-2 col-md-2 price-container" data-price-name="price_half_ounce"><span class="price">90 </span><span class="price-label">1/2</span></div>
                <div class="col-sm-2 col-md-2 price-container" data-price-name="price_ounce"><span class="price">165 </span><span class="price-label">oz</span></div>
            </div>
        </div>
        <div class="row item-options" style="display: none;">
            <div class="col-md-3 text"></div>
            <div class="col-md-2 category-id">
                <div class="category-id-select" style="display: none;"></div>
            </div>
            <div class="current-category-id" id="current-category-menu-item-5390083" style="display: none;">1</div>
        </div>
        <div class="row">
            <div class="col-md-12 dispensary_name"><a href="/dispensaries/cannabicare">Cannabicare</a></div>
        </div>
        <div style="height:1px"></div>
        <div class="row item_details">
            <div class="col-md-10">75% Indica / 25% Sativa</div>
        </div>
    </div>
</div>

This is the HTML you'll need to work on.

And you could extract the data using something like:

for category in response.css('div.menu_item'):
    print "--- Category:", category.xpath('@data-category-name').extract()
    for row in category.css('div.menu-item-content > div.row:first-child'):
        print row.xpath('string(.//span[@class="item_name"])').extract()
        for price in row.css('div.prices > div.price-container'):
            print "Price:", price.xpath('@data-price-name').extract(), price.css('span.price::text').extract()

which outputs:

--- Category: [u'Indica']
[u'Banana OG - Member Pricing']
Price: [u'price_gram'] [u'9 ']
Price: [u'price_eighth'] [u'30 ']
Price: [u'price_quarter'] [u'60 ']
Price: [u'price_half_ounce'] [u'90 ']
Price: [u'price_ounce'] [u'165 ']
--- Category: [u'Indica']
[u'Purple Kush - Member Pricing']
Price: [u'price_gram'] [u'9 ']
Price: [u'price_eighth'] [u'30 ']
Price: [u'price_quarter'] [u'60 ']
Price: [u'price_half_ounce'] [u'90 ']
Price: [u'price_ounce'] [u'165 ']
...
1
On

You're not getting any results because between div[@id="menu_item_category_4"] and span[@class="item_name"] you have only /, which means the span has to be a direct child of the div. Use // between them instead, so that the span can be any descendant of the div.

Looking at the DOM tree in Chrome, I see about six levels of div descendants between div[@id="menu_item_category_1"] and span[@class="item_name"].