I'm trying to specify a Haystack pipeline using the YAML declarative syntax. I want to run a pipeline with two "lanes" whose answers will be merged - one using an EmbeddingRetriever to fetch answers from a query embedding, and one using a (sparse) BM25Retriever. I want each retriever to use the same Elastic instance, accessed via two ElasticsearchDocumentStore instances. Example:
components:
- name: DenseStore
type: ElasticsearchDocumentStore
params:
embedding_dim: 384 # This parameter is required for the embedding_model
index: dense_index
- name: SparseStore
type: ElasticsearchDocumentStore
params:
index: sparse_index`
At first I thought the problem was with trying to specify multiple DocumentStore instances, but discovered this wasn't it. The problem seems to be that one must use the name DocumentStore for a document store in the YAML file, which precludes specifying two DocumentStore instances wrapping the same Elastic instance.
My first attempt was to build the pipeline in Python on Colab as described above, using two InMemoryDocumentStore instances. This worked as expected. But when trying to move to a production setting I wanted to use the Haystack Docker image (run with an Elastic instance under Docker compose) and simply read in the YAML to specify the pipeline. When I did this, I would get an error that the Haystack DocumentStores could not connect to Elastic on localhost:9200. Running a test with a simplified YAML pipeline using the name DocumentStore for the document store component does connect successfully to Elastic. Obviously this isn't a solution because I want two DocumentStore instances and they can't have the same name in the pipeline YAML.
You need to use two separate Retrievers but one DocumentStore is enough. Each Retriever has a parameter
document_storewhere you need to specify the name of the DocumentStore it connects to. The DocumentStore can be the same for your sparse Retriever and your DenseRetriever. By the way, there are no restrictions on the names as you suggest in your question. Here is a pipeline.yaml example with two Retrievers using the same DocumentStore:As you mentioned the colab examples, you can try out the Haystack tutorial about pipelines and start elasticsearch in there as shown in the tutorial on elasticsearch. You will need to install Haystack with the elasticsearch extra first:
Then install elasticsearch:
Start an elasticsearch instance:
Wait until it's running and use it: