scrapy able to check if only next sibling has expected tag?

403 Views Asked by At

Let me post part of html I want to scrape first

<div id="hello">
  <p>abc</p>
  <center><img src="image_url"></center>
  <p align="center" style="text-align: center;"><b>def</b></p>
  <center><img src="image_url"></center>
  <p align="center" style="text-align: center;"><b>def</b></p>
  <p>abc</p>
  <p align="center" style="text-align: center;"><b>def</b></p>
  <center><img src="image_url"></center>
  <p align="center" style="text-align: center;"><b>def</b></p>
  <p>abc</p>
  <center><img src="image_url"></center>
</div>

I am trying to scrape the text in p and src of image which is the image_url in order. The thing is, the html I showed above is actually not static, all pages have different structure which means sometimes there'll be more p tags before having center tag which includes img src

Since the p and center tags are randomly structured in each pages, I was thinking of getting all the p tags for example using response.css('#hello p') then loop through all the p to get text but while getting the text from current p tag while looping, also check if next sibling has a center tag, if do then get the src append it.

I found something like that by doing p.xpath('following-sibling::center[1]/img/@src').get() as p is each paragraph duing the iteration.

But I figured, that does not work at all because let's say if I have 4 p tags until a center I will actually get 4 img src because that p.xpath('following-sibling::center[1]/img/@src').get() does not just find the next sibling but goes through all the siblings after and see if center tag is matched.

I tried googling but I do not see anything mentioning only check if next sibling is some tag. Anyone has any idea I can get it work so I can save the data in sequence?

Hopefully my explanation makes sense.

Thanks in advance for any help and suggestions

1

There are 1 best solutions below

0
On BEST ANSWER

Try below XPath to get required output

p.xpath('following-sibling::*[1][name()="center"]/img/@src')