Accessing google cloud bucket via FS Crawler (elasticsearch)

249 Views Asked by At

The project I am currently working on needs a search engine to search a couple of 10.000 pdf files. When the user searches via the website for a certain keyword, the search engine will return a snippet of the pdf files matching his search criteria. The user then has the option to click on a button to view the entire pdf file.

I figured that the best way to do this was using elasticsearch + fscrawler (https://fscrawler.readthedocs.io/en/fscrawler-2.7/). Running some tests today and was able to crawl to a folder on my local machine.

For serving the PDF files (via a website), I figured I could store the PDF files in a google cloud storage and then use the link of the google cloud storage to let the users view the pdf files. However, FS Crawler does not seem to be able to access the bucket. Any tips or ideas on how to solve this. Feel free to criticize the work method described above. If there are better ways to make the users of the website access the PDF files, I would love to hear it.

Thanks in advance and kind regards!

1

There are 1 best solutions below

0
On

You can use s3fs-fuse to mount s3 bucket into your file system and then use normal Local FS crawler.

Alternatively, you can fork fscrawler and implement a crawler for s3 similar to crawler-ftp.