In my items.py
:
class NewAdsItem(Item):
AdId = Field()
DateR = Field()
AdURL = Field()
In my pipelines.py
:
import sqlite3
from scrapy.conf import settings
con = None
class DbPipeline(object):
def __init__(self):
self.setupDBCon()
self.createTables()
def setupDBCon(self):
# This is NOT OK!
# I want to get the items already HERE!
dbfile = settings.get('SQLITE_FILE')
self.con = sqlite3.connect(dbfile)
self.cur = self.con.cursor()
def createTables(self):
# OR optionally HERE.
self.createDbTable()
...
def process_item(self, item, spider):
self.storeInDb(item)
return item
def storeInDb(self, item):
# This is OK, I CAN get the items in here, using:
# item.keys() and/or item.values()
sql = "INSERT INTO {0} ({1}) VALUES ({2})".format(self.dbtable, ','.join(item.keys()), ','.join(['?'] * len(item.keys())) )
...
How can I get the item list names (like "AdId" etc) from items.py, before process_item()
(in pipelines.py) is executed?
I use scrapy runspider myspider.py
for execution.
I already tried to add "item" and/or "spider" like this def setupDBCon(self, item)
, but that didn't work, and resulted in:
TypeError: setupDBCon() missing 1 required positional argument: 'item'
UPDATE: 2018-10-08
Result (A):
Partially following the solution from @granitosaurus I found that I can get the item keys as a list, by:
- Adding (a):
from adbot.items import NewAdsItem
to my main spider code. - Adding (b):
ikeys = NewAdsItem.fields.keys()
within the Class of above. - I could then access the keys from my
pipelines.py
via:
def open_spider(self, spider):
self.ikeys = list(spider.ikeys)
print("Keys in pipelines: \t%s" % ",".join(self.ikeys) )
#self.createDbTable(ikeys)
However, there were 2 problems with this method:
I was not able to get the ikeys list, into the
createDbTable()
. (I kept getting errors about missing arguments here and there.)The ikeys list (as retrieved) was re-arranged and did not keep the order of the items, as they appear in items.py, which partially defeated the purpose. I still don't understand why these are out of order, when all docs says that Python3 should keep the order of dicts and lists etc. While at the same time, when using
process_item()
and getting the items via:item.keys()
their order remain intact.
Result (B):
At the end of the day, it turned out too laborious and complicated to fix (A), so I just imported the relevant items.py
Class into my pipelines.py
, and use the item list as a global variable, like this:
def createDbTable(self):
self.ikeys = NewAdsItem.fields.keys()
print("Keys in creatDbTable: \t%s" % ",".join(self.ikeys) )
...
In this case I just decided to accept that the list obtained seem to be alphabetically sorted, and worked around the issue by just changing the key names. (Cheating!)
This is disappointing, because the code is ugly and contorted. Any better suggestions would be much appreciated.
Scrapy pipelines have 3 connected methods:
https://doc.scrapy.org/en/latest/topics/item-pipeline.html
So you can only get access to item in
process_item
method.If you want to get item class however you can attach it to spider class:
Alternative you can lazy load during
process_item
method itself: