Haystack pipeline YAML with multiple ElasticsearchDocumentStore components seems impossible

229 Views Asked by At

I'm trying to specify a Haystack pipeline using the YAML declarative syntax. I want to run a pipeline with two "lanes" whose answers will be merged - one using an EmbeddingRetriever to fetch answers from a query embedding, and one using a (sparse) BM25Retriever. I want each retriever to use the same Elastic instance, accessed via two ElasticsearchDocumentStore instances. Example:

components:
  - name: DenseStore
    type: ElasticsearchDocumentStore 
    params: 
      embedding_dim: 384 # This parameter is required for the embedding_model
      index: dense_index
  - name: SparseStore
    type: ElasticsearchDocumentStore 
    params: 
      index: sparse_index`

At first I thought the problem was with trying to specify multiple DocumentStore instances, but discovered this wasn't it. The problem seems to be that one must use the name DocumentStore for a document store in the YAML file, which precludes specifying two DocumentStore instances wrapping the same Elastic instance.

My first attempt was to build the pipeline in Python on Colab as described above, using two InMemoryDocumentStore instances. This worked as expected. But when trying to move to a production setting I wanted to use the Haystack Docker image (run with an Elastic instance under Docker compose) and simply read in the YAML to specify the pipeline. When I did this, I would get an error that the Haystack DocumentStores could not connect to Elastic on localhost:9200. Running a test with a simplified YAML pipeline using the name DocumentStore for the document store component does connect successfully to Elastic. Obviously this isn't a solution because I want two DocumentStore instances and they can't have the same name in the pipeline YAML.

1

There are 1 best solutions below

2
Julian Risch On

You need to use two separate Retrievers but one DocumentStore is enough. Each Retriever has a parameter document_store where you need to specify the name of the DocumentStore it connects to. The DocumentStore can be the same for your sparse Retriever and your DenseRetriever. By the way, there are no restrictions on the names as you suggest in your question. Here is a pipeline.yaml example with two Retrievers using the same DocumentStore:

components:
- name: MyElasticsearchDocumentStore
  params: {}
  type: ElasticsearchDocumentStore
- name: BM25Retriever
  params:
    document_store: MyElasticsearchDocumentStore
  type: BM25Retriever
- name: EmbeddingRetriever
  params:
    document_store: MyElasticsearchDocumentStore
    embedding_model: sentence-transformers/multi-qa-mpnet-base-dot-v1
  type: EmbeddingRetriever
- name: JoinResults
  params: {}
  type: JoinDocuments
- name: Reader
  params:
    model_name_or_path: deepset/roberta-base-squad2
  type: FARMReader
pipelines:
- name: query
  nodes:
  - inputs:
    - Query
    name: BM25Retriever
  - inputs:
    - Query
    name: EmbeddingRetriever
  - inputs:
    - BM25Retriever
    - EmbeddingRetriever
    name: JoinResults
  - inputs:
    - JoinResults
    name: Reader
version: 1.19.0

As you mentioned the colab examples, you can try out the Haystack tutorial about pipelines and start elasticsearch in there as shown in the tutorial on elasticsearch. You will need to install Haystack with the elasticsearch extra first:

pip install farm-haystack[colab,inference,elasticsearch]

Then install elasticsearch:

%%bash

wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q
tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
chown -R daemon:daemon elasticsearch-7.9.2

Start an elasticsearch instance:

%%bash --bg

sudo -u daemon -- elasticsearch-7.9.2/bin/elasticsearch

Wait until it's running and use it:

import os
from haystack.document_stores import ElasticsearchDocumentStore
import time

time.sleep(30)

# Get the host where Elasticsearch is running, default to localhost
host = os.environ.get("ELASTICSEARCH_HOST", "localhost")

document_store = ElasticsearchDocumentStore(host=host, username="", password="", index="document")