2 RabbitMQ workers and 2 Scrapyd daemons running on 2 local Ubuntu instances, in which one of the rabbitmq worker is not working

562 Views Asked by At

I am currently working on building "Scrapy spiders control panel" in which I am testing this existing solution available on [Distributed Multi-user Scrapy Spiders Control Panel] https://github.com/aaldaber/Distributed-Multi-User-Scrapy-System-with-a-Web-UI.

I am trying to run this on my local Ubuntu Dev Machine but having issues with scrapd daemon. One of the Workers, linkgenerator is working but scraper as worker1 is not working. I can not figure out why scrapyd won't run on another local instance.

Background Information about the configuration.

The application comes bundled with Django, Scrapy, Pipeline for MongoDB (for saving the scraped items) and Scrapy scheduler for RabbitMQ (for distributing the links among workers). I have 2 local Ubuntu instances in which Django, MongoDB, Scrapyd daemon and RabbitMQ server running on Instance1. On another Scrapyd daemon is running on Instance2. RabbitMQ Workers:

  • linkgenerator
  • worker1

IP Configurations for Instances:

  • IP For local Ubuntu Instance1: 192.168.0.101
  • IP for local Ubuntu Instance2: 192.168.0.106

List of tools used:

  • MongoDB server
  • RabbitMQ server
  • Scrapy Scrapyd API
  • One RabbitMQ linkgenerator worker (WorkerName: linkgenerator) server with Scrapy installed and running scrapyd daemon on local Ubuntu Instance1: 192.168.0.101
  • Another one RabbitMQ scraper worker (WorkerName: worker1) server with Scrapy installed and running scrapyd daemon on local Ubuntu Instance2: 192.168.0.106

Instance1: 192.168.0.101

"Instance1" on which Django, RabbitMQ, scrapyd daemon servers running -- IP : 192.168.0.101

Instance2: 192.168.0.106

Scrapy installed on instance2 and running scrapyd daemon

Scrapy Control Panel UI Snapshot:

from snapshot, control panel outlook can be been seen, there are two workers, linkgenerator worked successfully but worker1 did not, the logs given in the end of the post

RabbitMQ status info

linkgenerator worker can successfully push the message to RabbitMQ queue, linkgenerator spider generates start_urls for "scraper spider* are consumed by scraper (worker1), which is not working, please see the logs for worker1 in end of the post

RabbitMQ settings

The below file contains the settings for MongoDB and RabbitMQ:

SCHEDULER = ".rabbitmq.scheduler.Scheduler"
SCHEDULER_PERSIST = True
RABBITMQ_HOST = 'ScrapyDevU79'
RABBITMQ_PORT = 5672
RABBITMQ_USERNAME = 'guest'
RABBITMQ_PASSWORD = 'guest'

MONGODB_PUBLIC_ADDRESS = 'OneScience:27017'  # This will be shown on the web interface, but won't be used for connecting to DB
MONGODB_URI = 'localhost:27017'  # Actual uri to connect to DB
MONGODB_USER = 'tariq'
MONGODB_PASSWORD = 'toor'
MONGODB_SHARDED = True
MONGODB_BUFFER_DATA = 100

# Set your link generator worker address here
LINK_GENERATOR = 'http://192.168.0.101:6800'
SCRAPERS = ['http://192.168.0.106:6800']
LINUX_USER_CREATION_ENABLED = False  # Set this to True if you want a linux user account
linkgenerator scrapy.cfg settings:
[settings]
default = tester2_fda_trial20.settings
[deploy:linkgenerator]
url = http://192.168.0.101:6800
project = tester2_fda_trial20
scraper scrapy.cfg settings:
[settings]
default = tester2_fda_trial20.settings

[deploy:worker1]
url = http://192.168.0.101:6800
project = tester2_fda_trial20
scrapyd.conf file settings for Instance1 (192.168.0.101)

cat /etc/scrapyd/scrapyd.conf

[scrapyd]
eggs_dir   = /var/lib/scrapyd/eggs
dbs_dir    = /var/lib/scrapyd/dbs
items_dir  = /var/lib/scrapyd/items
logs_dir   = /var/log/scrapyd

max_proc    = 0
max_proc_per_cpu = 4
finished_to_keep = 100
poll_interval = 5.0
bind_address = 0.0.0.0
#bind_address = 127.0.0.1
http_port   = 6800
debug       = on
runner      = scrapyd.runner
application = scrapyd.app.application
launcher    = scrapyd.launcher.Launcher
webroot     = scrapyd.website.Root

[services]
schedule.json     = scrapyd.webservice.Schedule
cancel.json       = scrapyd.webservice.Cancel
addversion.json   = scrapyd.webservice.AddVersion
listprojects.json = scrapyd.webservice.ListProjects
listversions.json = scrapyd.webservice.ListVersions
listspiders.json  = scrapyd.webservice.ListSpiders
delproject.json   = scrapyd.webservice.DeleteProject
delversion.json   = scrapyd.webservice.DeleteVersion
listjobs.json     = scrapyd.webservice.ListJobs
daemonstatus.json = scrapyd.webservice.DaemonStatus
scrapyd.conf file settings for Instance2 (192.168.0.106)

cat /etc/scrapyd/scrapyd.conf

[scrapyd]
eggs_dir   = /var/lib/scrapyd/eggs
dbs_dir    = /var/lib/scrapyd/dbs
items_dir  = /var/lib/scrapyd/items
logs_dir   = /var/log/scrapyd

max_proc    = 0
max_proc_per_cpu = 4
finished_to_keep = 100
poll_interval = 5.0
bind_address = 0.0.0.0
#bind_address = 127.0.0.1
http_port   = 6800
debug       = on
runner      = scrapyd.runner
application = scrapyd.app.application
launcher    = scrapyd.launcher.Launcher
webroot     = scrapyd.website.Root

[services]
schedule.json     = scrapyd.webservice.Schedule
cancel.json       = scrapyd.webservice.Cancel
addversion.json   = scrapyd.webservice.AddVersion
listprojects.json = scrapyd.webservice.ListProjects
listversions.json = scrapyd.webservice.ListVersions
listspiders.json  = scrapyd.webservice.ListSpiders
delproject.json   = scrapyd.webservice.DeleteProject
delversion.json   = scrapyd.webservice.DeleteVersion
listjobs.json     = scrapyd.webservice.ListJobs
daemonstatus.json = scrapyd.webservice.DaemonStatus
RabbitMQ Status

sudo service rabbitmq-server status

[sudo] password for mtaziz:
Status of node rabbit@ScrapyDevU79
[{pid,53715},
{running_applications,
   [{rabbitmq_shovel_management,
        "Management extension for the Shovel plugin","3.6.11"},
    {rabbitmq_shovel,"Data Shovel for RabbitMQ","3.6.11"},
    {rabbitmq_management,"RabbitMQ Management Console","3.6.11"},
    {rabbitmq_web_dispatch,"RabbitMQ Web Dispatcher","3.6.11"},
    {rabbitmq_management_agent,"RabbitMQ Management Agent","3.6.11"},
    {rabbit,"RabbitMQ","3.6.11"},
    {os_mon,"CPO  CXC 138 46","2.2.14"},
    {cowboy,"Small, fast, modular HTTP server.","1.0.4"},
    {ranch,"Socket acceptor pool for TCP protocols.","1.3.0"},
    {ssl,"Erlang/OTP SSL application","5.3.2"},
    {public_key,"Public key infrastructure","0.21"},
    {cowlib,"Support library for manipulating Web protocols.","1.0.2"},
    {crypto,"CRYPTO version 2","3.2"},
    {amqp_client,"RabbitMQ AMQP Client","3.6.11"},
    {rabbit_common,
        "Modules shared by rabbitmq-server and rabbitmq-erlang-client",
        "3.6.11"},
    {inets,"INETS  CXC 138 49","5.9.7"},
    {mnesia,"MNESIA  CXC 138 12","4.11"},
    {compiler,"ERTS  CXC 138 10","4.9.4"},
    {xmerl,"XML parser","1.3.5"},
    {syntax_tools,"Syntax tools","1.6.12"},
    {asn1,"The Erlang ASN1 compiler version 2.0.4","2.0.4"},
    {sasl,"SASL  CXC 138 11","2.3.4"},
    {stdlib,"ERTS  CXC 138 10","1.19.4"},
    {kernel,"ERTS  CXC 138 10","2.16.4"}]},
{os,{unix,linux}},
{erlang_version,
   "Erlang R16B03 (erts-5.10.4) [source] [64-bit] [smp:4:4] [async-threads:64] [kernel-poll:true]\n"},
{memory,
   [{connection_readers,0},
    {connection_writers,0},
    {connection_channels,0},
    {connection_other,6856},
    {queue_procs,145160},
    {queue_slave_procs,0},
    {plugins,1959248},
    {other_proc,22328920},
    {metrics,160112},
    {mgmt_db,655320},
    {mnesia,83952},
    {other_ets,2355800},
    {binary,96920},
    {msg_index,47352},
    {code,27101161},
    {atom,992409},
    {other_system,31074022},
    {total,87007232}]},
{alarms,[]},
{listeners,[{clustering,25672,"::"},{amqp,5672,"::"},{http,15672,"::"}]},
{vm_memory_calculation_strategy,rss},
{vm_memory_high_watermark,0.4},
{vm_memory_limit,3343646720},
{disk_free_limit,50000000},
{disk_free,56257699840},
{file_descriptors,
   [{total_limit,924},{total_used,2},{sockets_limit,829},{sockets_used,0}]},
{processes,[{limit,1048576},{used,351}]},
{run_queue,0},
{uptime,34537},
{kernel,{net_ticktime,60}}]
scrapyd daemon on Instance1 ( 192.168.0.101 ) running status:

scrapyd

2017-09-11T06:16:07+0600 [-] Loading /home/mtaziz/.virtualenvs/onescience_dist_env/local/lib/python2.7/site-packages/scrapyd/txapp.py...
2017-09-11T06:16:07+0600 [-] Scrapyd web console available at http://0.0.0.0:6800/
2017-09-11T06:16:07+0600 [-] Loaded.
2017-09-11T06:16:07+0600 [twisted.scripts._twistd_unix.UnixAppLogger#info] twistd 17.5.0 (/home/mtaziz/.virtualenvs/onescience_dist_env/bin/python 2.7.6) starting up.
2017-09-11T06:16:07+0600 [twisted.scripts._twistd_unix.UnixAppLogger#info] reactor class: twisted.internet.epollreactor.EPollReactor.
2017-09-11T06:16:07+0600 [-] Site starting on 6800
2017-09-11T06:16:07+0600 [twisted.web.server.Site#info] Starting factory <twisted.web.server.Site instance at 0x7f5e265c77a0>
2017-09-11T06:16:07+0600 [Launcher] Scrapyd 1.2.0 started: max_proc=16, runner='scrapyd.runner'
2017-09-11T06:16:07+0600 [twisted.python.log#info] "192.168.0.101" - - [11/Sep/2017:00:16:07 +0000] "GET /listprojects.json HTTP/1.1" 200 98 "-" "python-requests/2.18.4"
2017-09-11T06:16:07+0600 [twisted.python.log#info] "192.168.0.101" - - [11/Sep/2017:00:16:07 +0000] "GET /listversions.json?project=tester2_fda_trial20 HTTP/1.1" 200 80 "-" "python-requests/2.18.4"
2017-09-11T06:16:07+0600 [twisted.python.log#info] "192.168.0.101" - - [11/Sep/2017:00:16:07 +0000] "GET /listjobs.json?project=tester2_fda_trial20 HTTP/1.1" 200 92 "-" "python-requests/2.18.4"
scrapyd daemon on instance2 (192.168.0.106) running status:

scrapyd

2017-09-11T06:09:28+0600 [-] Loading /home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/scrapyd/txapp.py...
2017-09-11T06:09:28+0600 [-] Scrapyd web console available at http://0.0.0.0:6800/
2017-09-11T06:09:28+0600 [-] Loaded.
2017-09-11T06:09:28+0600 [twisted.scripts._twistd_unix.UnixAppLogger#info] twistd 17.5.0 (/home/mtaziz/.virtualenvs/scrapydevenv/bin/python 2.7.6) starting up.
2017-09-11T06:09:28+0600 [twisted.scripts._twistd_unix.UnixAppLogger#info] reactor class: twisted.internet.epollreactor.EPollReactor.
2017-09-11T06:09:28+0600 [-] Site starting on 6800
2017-09-11T06:09:28+0600 [twisted.web.server.Site#info] Starting factory <twisted.web.server.Site instance at 0x7fbe6eaeac20>
2017-09-11T06:09:28+0600 [Launcher] Scrapyd 1.2.0 started: max_proc=16, runner='scrapyd.runner'
2017-09-11T06:09:32+0600 [twisted.python.log#info] "192.168.0.101" - - [11/Sep/2017:00:09:32 +0000] "GET /listprojects.json HTTP/1.1" 200 98 "-" "python-requests/2.18.4"
2017-09-11T06:09:32+0600 [twisted.python.log#info] "192.168.0.101" - - [11/Sep/2017:00:09:32 +0000] "GET /listversions.json?project=tester2_fda_trial20 HTTP/1.1" 200 80 "-" "python-requests/2.18.4"
2017-09-11T06:09:32+0600 [twisted.python.log#info] "192.168.0.101" - - [11/Sep/2017:00:09:32 +0000] "GET /listjobs.json?project=tester2_fda_trial20 HTTP/1.1" 200 92 "-" "python-requests/2.18.4"
2017-09-11T06:09:37+0600 [twisted.python.log#info] "192.168.0.101" - - [11/Sep/2017:00:09:37 +0000] "GET /listprojects.json HTTP/1.1" 200 98 "-" "python-requests/2.18.4"
2017-09-11T06:09:37+0600 [twisted.python.log#info] "192.168.0.101" - - [11/Sep/2017:00:09:37 +0000] "GET /listversions.json?project=tester2_fda_trial20 HTTP/1.1" 200 80 "-" "python-requests/2.18.4"
worker1 logs

After updating the code for RabbitMQ server settings followed by the suggestions made by @Tarun Lalwani

The suggestion was to use RabbitMQ Server IP - 192.168.0.101:5672 instead of 127.0.0.1:5672. After I updated as suggested by Tarun Lalwani got the new problems as below............

2017-09-11 15:49:18 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: tester2_fda_trial20)
2017-09-11 15:49:18 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tester2_fda_trial20.spiders', 'ROBOTSTXT_OBEY': True, 'LOG_LEVEL': 'INFO', 'SPIDER_MODULES': ['tester2_fda_trial20.spiders'], 'BOT_NAME': 'tester2_fda_trial20', 'FEED_URI': 'file:///var/lib/scrapyd/items/tester2_fda_trial20/tester2_fda_trial20/79b1123a96d611e79276000c29bad697.jl', 'SCHEDULER': 'tester2_fda_trial20.rabbitmq.scheduler.Scheduler', 'TELNETCONSOLE_ENABLED': False, 'LOG_FILE': '/var/log/scrapyd/tester2_fda_trial20/tester2_fda_trial20/79b1123a96d611e79276000c29bad697.log'}
2017-09-11 15:49:18 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.corestats.CoreStats']
2017-09-11 15:49:18 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-09-11 15:49:18 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-09-11 15:49:18 [scrapy.middleware] INFO: Enabled item pipelines:
['tester2_fda_trial20.pipelines.FdaTrial20Pipeline',
 'tester2_fda_trial20.mongodb.scrapy_mongodb.MongoDBPipeline']
2017-09-11 15:49:18 [scrapy.core.engine] INFO: Spider opened
2017-09-11 15:49:18 [pika.adapters.base_connection] INFO: Connecting to 192.168.0.101:5672
2017-09-11 15:49:18 [pika.adapters.blocking_connection] INFO: Created channel=1
2017-09-11 15:49:18 [scrapy.core.engine] INFO: Closing spider (shutdown)
2017-09-11 15:49:18 [pika.adapters.blocking_connection] INFO: Channel.close(0, Normal Shutdown)
2017-09-11 15:49:18 [pika.channel] INFO: Channel.close(0, Normal Shutdown)
2017-09-11 15:49:18 [scrapy.utils.signal] ERROR: Error caught on signal handler: <bound method ?.close_spider of <scrapy.extensions.feedexport.FeedExporter object at 0x7f94878b8c50>>
Traceback (most recent call last):
  File "/home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 150, in maybeDeferred
    result = f(*args, **kw)
  File "/home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/pydispatch/robustapply.py", line 55, in robustApply
    return receiver(*arguments, **named)
  File "/home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/scrapy/extensions/feedexport.py", line 201, in close_spider
    slot = self.slot
AttributeError: 'FeedExporter' object has no attribute 'slot'
2017-09-11 15:49:18 [scrapy.utils.signal] ERROR: Error caught on signal handler: <bound method ?.spider_closed of <Tester2Fda_Trial20Spider 'tester2_fda_trial20' at 0x7f9484f897d0>>
Traceback (most recent call last):
  File "/home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 150, in maybeDeferred
    result = f(*args, **kw)
  File "/home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/pydispatch/robustapply.py", line 55, in robustApply
    return receiver(*arguments, **named)
  File "/tmp/user/1000/tester2_fda_trial20-10-d4Req9.egg/tester2_fda_trial20/spiders/tester2_fda_trial20.py", line 28, in spider_closed
AttributeError: 'Tester2Fda_Trial20Spider' object has no attribute 'statstask'
2017-09-11 15:49:18 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'finish_reason': 'shutdown',
 'finish_time': datetime.datetime(2017, 9, 11, 9, 49, 18, 159896),
 'log_count/ERROR': 2,
 'log_count/INFO': 10}
2017-09-11 15:49:18 [scrapy.core.engine] INFO: Spider closed (shutdown)
2017-09-11 15:49:18 [twisted] CRITICAL: Unhandled error in Deferred:
2017-09-11 15:49:18 [twisted] CRITICAL: 
Traceback (most recent call last):
  File "/home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1386, in _inlineCallbacks
    result = g.send(result)
  File "/home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/scrapy/crawler.py", line 95, in crawl
    six.reraise(*exc_info)
  File "/home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/scrapy/crawler.py", line 79, in crawl
    yield self.engine.open_spider(self.spider, start_requests)
OperationFailure: command SON([('saslStart', 1), ('mechanism', 'SCRAM-SHA-1'), ('payload', Binary('n,,n=tariq,r=MjY5OTQ0OTYwMjA4', 0)), ('autoAuthorize', 1)]) on namespace admin.$cmd failed: Authentication failed.

MongoDBPipeline

# coding:utf-8

import datetime

from pymongo import errors
from pymongo.mongo_client import MongoClient
from pymongo.mongo_replica_set_client import MongoReplicaSetClient
from pymongo.read_preferences import ReadPreference
from scrapy.exporters import BaseItemExporter
try:
    from urllib.parse import quote
except:
    from urllib import quote

def not_set(string):
    """ Check if a string is None or ''

    :returns: bool - True if the string is empty
    """
    if string is None:
        return True
    elif string == '':
        return True
    return False


class MongoDBPipeline(BaseItemExporter):
    """ MongoDB pipeline class """
    # Default options
    config = {
        'uri': 'mongodb://localhost:27017',
        'fsync': False,
        'write_concern': 0,
        'database': 'scrapy-mongodb',
        'collection': 'items',
        'replica_set': None,
        'buffer': None,
        'append_timestamp': False,
        'sharded': False
    }

    # Needed for sending acknowledgement signals to RabbitMQ for all persisted items
    queue = None
    acked_signals = []

    # Item buffer
    item_buffer = dict()

    def load_spider(self, spider):
        self.crawler = spider.crawler
        self.settings = spider.settings
        self.queue = self.crawler.engine.slot.scheduler.queue

    def open_spider(self, spider):
        self.load_spider(spider)

        # Configure the connection
        self.configure()

        self.spidername = spider.name
        self.config['uri'] = 'mongodb://' + self.config['username'] + ':' + quote(self.config['password']) + '@' + self.config['uri'] + '/admin'
        self.shardedcolls = []

        if self.config['replica_set'] is not None:
            self.connection = MongoReplicaSetClient(
                self.config['uri'],
                replicaSet=self.config['replica_set'],
                w=self.config['write_concern'],
                fsync=self.config['fsync'],
                read_preference=ReadPreference.PRIMARY_PREFERRED)
        else:
            # Connecting to a stand alone MongoDB
            self.connection = MongoClient(
                self.config['uri'],
                fsync=self.config['fsync'],
                read_preference=ReadPreference.PRIMARY)

        # Set up the collection
        self.database = self.connection[spider.name]

        # Autoshard the DB
        if self.config['sharded']:
            db_statuses = self.connection['config']['databases'].find({})
            partitioned = []
            notpartitioned = []
            for status in db_statuses:
                if status['partitioned']:
                    partitioned.append(status['_id'])
                else:
                    notpartitioned.append(status['_id'])
            if spider.name in notpartitioned or spider.name not in partitioned:
                try:
                    self.connection.admin.command('enableSharding', spider.name)
                except errors.OperationFailure:
                    pass
            else:
                collections = self.connection['config']['collections'].find({})
                for coll in collections:
                    if (spider.name + '.') in coll['_id']:
                        if coll['dropped'] is not True:
                            if coll['_id'].index(spider.name + '.') == 0:
                                self.shardedcolls.append(coll['_id'][coll['_id'].index('.') + 1:])

    def configure(self):
        """ Configure the MongoDB connection """

        # Set all regular options
        options = [
            ('uri', 'MONGODB_URI'),
            ('fsync', 'MONGODB_FSYNC'),
            ('write_concern', 'MONGODB_REPLICA_SET_W'),
            ('database', 'MONGODB_DATABASE'),
            ('collection', 'MONGODB_COLLECTION'),
            ('replica_set', 'MONGODB_REPLICA_SET'),
            ('buffer', 'MONGODB_BUFFER_DATA'),
            ('append_timestamp', 'MONGODB_ADD_TIMESTAMP'),
            ('sharded', 'MONGODB_SHARDED'),
            ('username', 'MONGODB_USER'),
            ('password', 'MONGODB_PASSWORD')
        ]

        for key, setting in options:
            if not not_set(self.settings[setting]):
                self.config[key] = self.settings[setting]

    def process_item(self, item, spider):
        """ Process the item and add it to MongoDB

        :type item: Item object
        :param item: The item to put into MongoDB
        :type spider: BaseSpider object
        :param spider: The spider running the queries
        :returns: Item object
        """
        item_name = item.__class__.__name__

        # If we are working with a sharded DB, the collection will also be sharded
        if self.config['sharded']:
            if item_name not in self.shardedcolls:
                try:
                    self.connection.admin.command('shardCollection', '%s.%s' % (self.spidername, item_name), key={'_id': "hashed"})
                    self.shardedcolls.append(item_name)
                except errors.OperationFailure:
                    self.shardedcolls.append(item_name)

        itemtoinsert = dict(self._get_serialized_fields(item))

        if self.config['buffer']:
            if item_name not in self.item_buffer:
                self.item_buffer[item_name] = []
                self.item_buffer[item_name].append([])
                self.item_buffer[item_name].append(0)

            self.item_buffer[item_name][1] += 1

            if self.config['append_timestamp']:
                itemtoinsert['scrapy-mongodb'] = {'ts': datetime.datetime.utcnow()}

            self.item_buffer[item_name][0].append(itemtoinsert)

            if self.item_buffer[item_name][1] == self.config['buffer']:
                self.item_buffer[item_name][1] = 0
                self.insert_item(self.item_buffer[item_name][0], spider, item_name)

            return item

        self.insert_item(itemtoinsert, spider, item_name)
        return item

    def close_spider(self, spider):
        """ Method called when the spider is closed

        :type spider: BaseSpider object
        :param spider: The spider running the queries
        :returns: None
        """
        for key in self.item_buffer:
            if self.item_buffer[key][0]:
                self.insert_item(self.item_buffer[key][0], spider, key)

    def insert_item(self, item, spider, item_name):
        """ Process the item and add it to MongoDB

        :type item: (Item object) or [(Item object)]
        :param item: The item(s) to put into MongoDB
        :type spider: BaseSpider object
        :param spider: The spider running the queries
        :returns: Item object
        """
        self.collection = self.database[item_name]

        if not isinstance(item, list):

            if self.config['append_timestamp']:
                item['scrapy-mongodb'] = {'ts': datetime.datetime.utcnow()}

            ack_signal = item['ack_signal']
            item.pop('ack_signal', None)
            self.collection.insert(item, continue_on_error=True)
            if ack_signal not in self.acked_signals:
                self.queue.acknowledge(ack_signal)
                self.acked_signals.append(ack_signal)
        else:
            signals = []
            for eachitem in item:
                signals.append(eachitem['ack_signal'])
                eachitem.pop('ack_signal', None)
            self.collection.insert(item, continue_on_error=True)
            del item[:]
            for ack_signal in signals:
                if ack_signal not in self.acked_signals:
                    self.queue.acknowledge(ack_signal)
                    self.acked_signals.append(ack_signal)

To sum up, I believe the problem lies in scrapyd daemons running on both instances but somehow scraper or worker1 can not access it, I could not figure it out, I did not find any use cases on stackoverflow.

Any help is highly appreciated in this regard. Thank you in advance!

0

There are 0 best solutions below