I have a PSQL database table brands where are columns like id, name, and other columns.
My (simplified) code - MySpider.py:
import DB
class MySpider(scrapy.Spider):
db = DB.connect()
def start_requests(self):
urls = [ 'https://www.website.com']
for url in URLs:
yield Request(url=url, callback=self.parse, meta=meta)
def parse(self, response):
cars = response.css('...')
for car in cars:
item = CarLoader(item=Car(), selector=car)
data.add_value('brand_id', car.css('...').get())
...
items.py:
import scrapy
class Car(scrapy.Item):
name = scrapy.Field()
brand_id = scrapy.Field()
established = scrapy.Field()
...
itemsloaders.py:
from itemloaders.processors import TakeFirst, MapCompose
from scrapy.loader import ItemLoader
class CarLoader(ItemLoader):
default_output_processor = TakeFirst()
When I am saving a new item to the database (that's done in pipeline.py), I don't want to save to the column cars.brand_id the brand name (BMW, Audi, etc.) of the car, but its ID (this ID is stored in brands.id).
What's the proper way of doing that? I need to search the name of the brand in the brands table and the found ID save to cars.brand_id - but where should I place this operation, so it's logical and scrappy-correct?
I was thinking and doing that in MySpider.py, as well as in pipeline.py, but I find it a bit dirty and it does not feel it belongs there.
It seems that this functionality should be placed to itemsloaders.py, but the purpose of this file is a bit mystical to me. How do I resolve this?