I have a set of 1000 documents (plain texts) and one user query. I want to retrieve the top k documents that are the most relevant to a user query using the Python library Haystack and Faiss. Specially, I want the system to identify the top k sentences that are the closest match to the user query, and then returns the documents that contain these sentences. How can I do so?
The following code identifies the top k documents that are the closest match to the user query. How can I change it so that instead, the code identifies the top k sentences that are the closest match to the user query, and returns the documents that contain these sentences.
# Note: Most of the code is from https://haystack.deepset.ai/tutorials/07_rag_generator
import logging
logging.basicConfig(format="%(levelname)s - %(name)s - %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.INFO)
import pandas as pd
from haystack.utils import fetch_archive_from_http
# Download sample
doc_dir = "data/tutorial7/"
s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/small_generator_dataset.csv.zip"
fetch_archive_from_http(url=s3_url, output_dir=doc_dir)
# Create dataframe with columns "title" and "text"
#df = pd.read_csv(f"{doc_dir}/small_generator_dataset.csv", sep=",")
df = pd.read_csv(f"{doc_dir}/small_generator_dataset.csv", sep=",",nrows=10)
# Minimal cleaning
df.fillna(value="", inplace=True)
print(df.head())
from haystack import Document
# Use data to initialize Document objects
titles = list(df["title"].values)
texts = list(df["text"].values)
documents = []
for title, text in zip(titles, texts):
documents.append(Document(content=text, meta={"name": title or ""}))
from haystack.document_stores import FAISSDocumentStore
document_store = FAISSDocumentStore(faiss_index_factory_str="Flat", return_embedding=True)
from haystack.nodes import RAGenerator, DensePassageRetriever
retriever = DensePassageRetriever(
document_store=document_store,
query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base",
use_gpu=True,
embed_title=True,
)
# Delete existing documents in documents store
document_store.delete_documents()
# Write documents to document store
document_store.write_documents(documents)
# Add documents embeddings to index
document_store.update_embeddings(retriever=retriever)
from haystack.pipelines import GenerativeQAPipeline
from haystack import Pipeline
pipeline = Pipeline()
pipeline.add_node(component=retriever, name='Retriever', inputs=['Query'])
from haystack.utils import print_answers
QUESTIONS = [
"who got the first nobel prize in physics",
"when is the next deadpool movie being released",
]
for question in QUESTIONS:
res = pipeline.run(query=question, params={"Retriever": {"top_k": 5}})
print(res)
#print_answers(res, details="all")
To run the code:
conda create -y --name haystacktest python==3.9
conda activate haystacktest
pip install --upgrade pip
pip install farm-haystack
conda install pytorch -c pytorch
pip install sentence_transformers
pip install farm-haystack[colab,faiss]==1.17.2
E.g., I wonder if there is a way to amend the Faiss indexing strategy.
As Stefano Fiorucci - anakin87 and bilge suggested, one can add metadata to the documents being indexed by in the vector database. Therefore, one can index each sentence in the vector database, and use the metadata to link each sentence back to their original document.
Here is bilge's full answer:
Here is an example of code where a vector store is created using metadata with langchain (not haystack, but the same principle applies):
Tested with Python 3.11 with: