I am attempting to build out xpath constructs for data items I would like to extract from several hundred pages of a site that are all formatted the same. An example site is https://weedmaps.com/dispensaries/cannabicare
As can be seen the site has headings and within those headings are rows of item names and prices. I am trying to extract the sections, the item names, and the item prices whether its per gram, 8th, ounce or for edibles the price per unit and keep them all categorized. Some example scrapy item fields are the following:
Sativa_Item_Name=scrapy.Field()
Sative_item_price_gra,=scrapy.Field()
Sativa_item_price_eigth=scrapy.Field()
Sativa_item_price_quarter=scrapy.Field()
Edible_Item_Name=scrapy.Field()
Edible_item_Price_Each=scrapy.Field()
And so on and so forth. I am able to extract all item names and all price/gram with xpaths such as the following:
response.xpath('.//div/span[@class="item_name"]/text()'].extract()
response.xpath('//div[@data-price-name="price_gram"]/span/text()').extract()
I can't figure out how to extract just items within the heading containers, like just the price per gram for items in the Hybrid category, price for each item and the item name in the Edible Category.
They are separated such as id="menu_item_category_4" but when I do something like:
response.xpath('//div[@id="menu_item_category_4"]/span[@class="item_name"]/text()').extract()
it yields empty brackets and no results. Any guidance on this would be beyond appreciated. Thank you so much for taking the time to look at this!
The thing is that what you see in your browser is after Javascript has formatted stuff, presumably Angular.
If you run the HTML source in a HTML source beautifier, and search for
<span class="item_name">
you'll see a pattern like this, repeating blocks ofThis is the HTML you'll need to work on.
And you could extract the data using something like:
which outputs: