Summarization and Topic Extraction with LLMs (private) and LangChain or LlamaIndex using flan-t5-small

444 Views Asked by At

has anyone used Langchain or LlamaIndex imports to deal with single documents that amount to >512 tokens? Yes, I know there are other approaches to dealing with it, but it is difficult to find documentation on the net that details exactly how to use LangChain with a private LLM that is accessible via API call. Most of the documentation deals with the commercialized LLMs. If you have, I would appreciate some strategies or sample code that would explain how to handle the llm wrapper with langchain and specifically for summarization and topic extraction.

1

There are 1 best solutions below

0
j3ffyang On

Here's a sample code of using LangChain to orchestrate open-source LLM, for embedding and txt2txtGen both. It doesn't matter if a document has >512 tokens. You can use loader.load_and_split() function to load and split large doc into smaller chuncks (reference for PDF document > https://python.langchain.com/docs/modules/data_connection/document_loaders/pdf)

from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.llms import HuggingFaceHub
from langchain.prompts import PromptTemplate
from langchain.chains.retrieval_qa.base import RetrievalQA

# embeddings = HuggingFaceEmbeddings(model_name='bert-base-uncased')
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2")
# docsearch = FAISS.from_documents(texts, embeddings)
docsearch = FAISS.from_texts(
    ["harry potter's owl is in the castle. The book is about 'To Kill A Mocking Swan'. There is another monkey"], embeddings)

llm = HuggingFaceHub(repo_id = "google/flan-t5-base",
                     model_kwargs={"temperature":0.6,"max_length": 500, "max_new_tokens": 200
                                  })

prompt_template = """
Compare the book given in question with others in the retriever based on genre and description.
Return a complete sentence with the full title of the book and describe the similarities between the books.

question: {question}
context: {context}
"""

prompt = PromptTemplate(template=prompt_template, input_variables=["context", "question"])
retriever=docsearch.as_retriever()
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever, chain_type_kwargs = {"prompt": prompt})
print(qa.run({"query": "Which book except 'To Kill A Mocking Bird' is similar to it?"}))