How to get/import the Scrapy item list from items.py to pipelines.py?

295 Views Asked by At

In my items.py:

class NewAdsItem(Item):
    AdId        = Field()
    DateR       = Field()
    AdURL       = Field()

In my pipelines.py:

import sqlite3
from scrapy.conf import settings

con = None
class DbPipeline(object):

    def __init__(self):
        self.setupDBCon()
        self.createTables()

    def setupDBCon(self):
        # This is NOT OK!
        # I want to get the items already HERE!
        dbfile = settings.get('SQLITE_FILE')
        self.con = sqlite3.connect(dbfile)
        self.cur = self.con.cursor()

    def createTables(self):
        # OR optionally HERE.
        self.createDbTable()

    ...

    def process_item(self, item, spider):
        self.storeInDb(item)
        return item

    def storeInDb(self, item):
        # This is OK, I CAN get the items in here, using: 
        # item.keys() and/or item.values()
        sql = "INSERT INTO {0} ({1}) VALUES ({2})".format(self.dbtable, ','.join(item.keys()), ','.join(['?'] * len(item.keys())) )
        ...

How can I get the item list names (like "AdId" etc) from items.py, before process_item() (in pipelines.py) is executed?


I use scrapy runspider myspider.py for execution.

I already tried to add "item" and/or "spider" like this def setupDBCon(self, item), but that didn't work, and resulted in: TypeError: setupDBCon() missing 1 required positional argument: 'item'


UPDATE: 2018-10-08

Result (A):

Partially following the solution from @granitosaurus I found that I can get the item keys as a list, by:

  1. Adding (a): from adbot.items import NewAdsItem to my main spider code.
  2. Adding (b): ikeys = NewAdsItem.fields.keys() within the Class of above.
  3. I could then access the keys from my pipelines.py via:
    def open_spider(self, spider):
        self.ikeys = list(spider.ikeys)
        print("Keys in pipelines: \t%s" % ",".join(self.ikeys) )
        #self.createDbTable(ikeys)

However, there were 2 problems with this method:

  1. I was not able to get the ikeys list, into the createDbTable(). (I kept getting errors about missing arguments here and there.)

  2. The ikeys list (as retrieved) was re-arranged and did not keep the order of the items, as they appear in items.py, which partially defeated the purpose. I still don't understand why these are out of order, when all docs says that Python3 should keep the order of dicts and lists etc. While at the same time, when using process_item() and getting the items via: item.keys() their order remain intact.

Result (B):

At the end of the day, it turned out too laborious and complicated to fix (A), so I just imported the relevant items.py Class into my pipelines.py, and use the item list as a global variable, like this:

def createDbTable(self):
    self.ikeys = NewAdsItem.fields.keys()
    print("Keys in creatDbTable: \t%s" % ",".join(self.ikeys) )
    ...

In this case I just decided to accept that the list obtained seem to be alphabetically sorted, and worked around the issue by just changing the key names. (Cheating!)

This is disappointing, because the code is ugly and contorted. Any better suggestions would be much appreciated.

1

There are 1 best solutions below

3
On

Scrapy pipelines have 3 connected methods:

process_item(self, item, spider)
This method is called for every item pipeline component. process_item() must either: return a dict with data, return an Item (or any descendant class) object, return a Twisted Deferred or raise DropItem exception. Dropped items are no longer processed by further pipeline components.

open_spider(self, spider)
This method is called when the spider is opened.

close_spider(self, spider)
This method is called when the spider is closed.

https://doc.scrapy.org/en/latest/topics/item-pipeline.html

So you can only get access to item in process_item method.

If you want to get item class however you can attach it to spider class:

class MySpider(Spider):
    item_cls = MyItem

class MyPipeline:
    def open_spider(self, spider):
        fields = spider.item_cls.fields
        # fields is a dictionary of key: default value
        self.setup_table(fields)

Alternative you can lazy load during process_item method itself:

class MyPipeline:
    item = None

def process_item(self, item, spider):
    if not self.item:
        self.item = item
        self.setup_table(item)