My Spider looks like this:
class ExampleSpider(scrapy.Spider):
name = 'example'
custom_settings = {
'ITEM_PIPELINES': {'img_clear.pipelines.DuplicatesPipeline': 100,},
'FEEDS': {
'feeds/example/tags.csv': {
'format': 'csv',
'fields': ["tag_id", "url", "title"],
'item_export_kwargs': {
'include_headers_line': False,
},
'item_classes': [ExampleTagItem],
'overwrite': False
},
'feeds/example/galleries.csv': {
'format': 'csv',
'fields': ["id", "url", "tag_ids"],
'item_export_kwargs': {
'include_headers_line': False,
},
'item_classes': [ExampleGalleryItem],
'overwrite': False,
}
}
}
This is the img_clear.pipelines.DuplicatesPipeline:
class DuplicatesPipeline():
def open_spider(self, spider):
if spider.name == "example":
with open("feeds/example/galleries.csv", "r") as rf:
csv = rf.readlines()
self.ids_seen = set([str(line.split(",")[0]) for line in csv])
with open("feeds/example/tags.csv", "r") as rf:
tags_csv = rf.readlines()
self.tag_ids_seen = set([str(line.split(",")[0]) for line in tags_csv])
def process_item(self, item, spider):
if isinstance(item, ExampleTagItem):
self.process_example_tag_item(item, spider)
elif isinstance(item, ExampleGalleryItem):
self.process_example_gallery_item(item, spider)
def process_example_tag_item(self, item, spider):
adapter = ItemAdapter(item)
if adapter['tag_id'] in self.tag_ids_seen:
raise DropItem(f"Duplicate item found: {item!r}")
else:
self.tag_ids_seen.add(adapter['tag_id'])
return item
def process_example_gallery_item(self, item, spider):
adapter = ItemAdapter(item)
if adapter['id'] in self.ids_seen:
raise DropItem(f"Duplicate item found: {item!r}")
else:
self.ids_seen.add(adapter['id'])
return item
With the item pipeline activated it will drop some items (logging: [scrapy.core.scraper] WARNING: Dropped: Duplicate item found: {'tag_id': '4',...) and return others (logging: [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.example.com/10232335/>) but noting is written to the files.
Somehow the returned items don't seem to reach the feed exports extension. What am I missing?
- When commenting out the
'ITEM_PIPELINES': {'img_clear.pipelines.DuplicatesPipeline': 100,},in thecustom_settings, items are saved in the right csv-files. - Using
scrapy crawl example -o test.csvwill create an empty csv when the pipeline is activated as well. So it seems that the issue is with the pipeline. - Printing the items right before they should be returned did print correct item information
- The pipeline is derived from the scrapy docs.
Thanks for the response! I'm not sure if this would actually have fixed it, since the feed was working perfectly with relative paths when the pipeline is deactivated. I might test that anyways some time.
However, I figured out an other mistake in my code that fixed it without changing the paths: The docs state, that the
process_itemfunction must return anitem object, return a twistedDeferredor raise aDropItemexception. My code was derived from here but I missed the return statements in the lines calling theprocess_..._itemfunctions.Tbh, I discovered the solution by accident trying to replicate my issue in a less complex spider and wrote up something like this and it worked:
Since I'm very new to coding: Any suggestions how to reduce the repetition in this code? I could use "id" in both Item objects but still would need to differentiate between the two sets so no idea how to do this...