What I need is to stat file based on head request, instead of downloading entire file. File is video(can be big), always have same name. Can determine update based on size and update change from head.
Code:
import treq
from scrapy.pipelines.files import FilesPipeline
from scrapy.pipelines.files import FSFilesStore
class FSFilesStoreDBCacheExtended(FSFilesStore):
async def stat_file(self, path, info):
try:
head_resp = await treq.head('https://google.com')
except Exception as e:
print(e) # ConnectError here
raise e
size = int(head_resp.headers['content-length'])
mod_timestamp = head_resp.headers['last-modified']
# ... do some stuff
return super().stat_file(path, info)
class ImageDownload(FilesPipeline):
def __init__(self, store_uri, download_func=None, settings=None):
self.STORE_SCHEMES[''] = FSFilesStoreDBCacheExtended
self.STORE_SCHEMES['file'] = FSFilesStoreDBCacheExtended
super().__init__(store_uri, download_func, settings)
treq fails with error:
[<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]
Most probably async is not allowed here.
There is this answer, but there is no access to crawler and spider and it looks very hacky: don't want media process logic to be in spider.
Question: how to make non blocking HTTP call from stat_file method with treq or any other lib or by scrapy tool?