Trouble downloading images with Scrapy - works sometimes

377 Views Asked by At

My spider code has been working well so far, but now when I am trying to run a batch of these spiders, everything works except that for some spiders, scrapy downloads the images, and for the rest nothing. All the spiders are the same except for the start_urls. Any help is appreciated!

Here's my pipelines.py

from scrapy.contrib.pipeline.images import ImagesPipeline
from scrapy.exceptions import DropItem
from scrapy.http import Request

class DmozPipeline(object):
    def process_item(self, item, spider):
    return item

class MyImagesPipeline(ImagesPipeline):
    def get_media_requests(self, item, info):
       for image_url in item['image_urls']:
        yield Request(image_url)

        for nlabel in item['nlabel']:
        yield Request(nlabel)

        print item['image_urls']


def item_completed(self, results, item, info):
    image_paths = [x['path'] for ok, x in results if ok]
    if not image_paths:
        raise DropItem("Item contains no images")
    item['image_paths'] = image_paths
    return item

settings.py:

BOT_NAME = 'dmoz2'
BOT_VERSION = '1.0'

SPIDER_MODULES = ['dmoz2.spiders']
NEWSPIDER_MODULE = 'dmoz2.spiders'
DEFAULT_ITEM_CLASS = 'dmoz2.items.DmozItem'
ITEM_PIPELINES = ['dmoz2.pipelines.MyImagesPipeline']
IMAGES_STORE = '/ps/dmoz2/images'
IMAGES_THUMBS = {
#letting height be variable
#'small': ('', 120),
'small': (120, ''),
#'big': ('', 240),
'big': (300, ''),
}


USER_AGENT = '%s/%s' % (BOT_NAME, BOT_VERSION)

items.py:

from scrapy.item import Item, Field
from scrapy.utils.python import unicode_to_str

def u_to_str(text):
   unicode_to_str(text,'latin-1','ignore')


class DmozItem(Item):
   category_ids = Field()
   ....
   image_urls = Field()
   image_paths = Field()

   pass

myspider.py:

from scrapy.spider import BaseSpider
from scrapy.spider import Spider
from scrapy.selector import HtmlXPathSelector
from scrapy import Selector
from scrapy.utils.url import urljoin_rfc
from scrapy.utils.response import get_base_url
from dmoz2.items import DmozItem

class DmozSpider(Spider):
   name = "fritos_jun2015"
   allowed_domains = ["walmart.com"]
   start_urls = [

    "http://www.walmart.com/ip/Fritos-Bar-B-Q-Flavored-Corn-Chips-9.75- oz/36915853",
    "http://www.walmart.com/ip/Fritos-Corn-Chips-1-oz-6-count/10900088",

]


def parse(self, response):
    hxs = Selector(response)
    sites = hxs.xpath('/html/body/div[1]/section/section[4]/div[2]')
    items = []
    for site in sites:
        item = DmozItem()
        item['category_ids'] = ''
        .....
        item['image_urls'] = site.xpath('div[1]/div[3]/div[1]/div/div/div[2]/div/div/div[1]/div/div/img[2]/@src').extract()
        items.append(item)
    return items

Would really like to know why this same spider fetches images sometimes, and at other times not. All the spiders are the same, except for the start_urls from the same allowed_domain. Also the images are all absolute path, and the path is correct.

Thanks in advance. -TM

1

There are 1 best solutions below

4
On

When screen scraping one problem that is common is that the server will cut the connection because you are trying to access it too often (to prevent screen scrapers from inadvertently ddosing their website and to prevent costs from going to high because someone pings their website every millisecond etc).

Try adding a

sleep()

method between every request to the walmart page. This way you wont get blocked from accessing the server.