Scrapy how to make async request from FSFilesStore/media pipeline?

53 Views Asked by At

What I need is to stat file based on head request, instead of downloading entire file. File is video(can be big), always have same name. Can determine update based on size and update change from head.

Code:

import treq
from scrapy.pipelines.files import FilesPipeline
from scrapy.pipelines.files import FSFilesStore

class FSFilesStoreDBCacheExtended(FSFilesStore):

    async def stat_file(self, path, info):
        try:
            head_resp = await treq.head('https://google.com')
        except Exception as e:
            print(e)  # ConnectError here
            raise e

        size = int(head_resp.headers['content-length'])
        mod_timestamp = head_resp.headers['last-modified']
        # ... do some stuff

        return super().stat_file(path, info)


class ImageDownload(FilesPipeline):

    def __init__(self, store_uri, download_func=None, settings=None):
        self.STORE_SCHEMES[''] = FSFilesStoreDBCacheExtended
        self.STORE_SCHEMES['file'] = FSFilesStoreDBCacheExtended
        super().__init__(store_uri, download_func, settings)

treq fails with error:

[<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]

Most probably async is not allowed here.

There is this answer, but there is no access to crawler and spider and it looks very hacky: don't want media process logic to be in spider.

Question: how to make non blocking HTTP call from stat_file method with treq or any other lib or by scrapy tool?

0

There are 0 best solutions below