Running LLM from local disk

40 Views Asked by At

I am trying to load LLM from the local disk of my laptop which is not working. when i try to load with the following approach its working as expected and i am getting response to my query.

def load_llm():
    # Load the locally downloaded model here
    llm = CTransformers(
        model = "TheBloke/Llama-2-7B-Chat-GGML",
        model_type="llama",
        config={'max_new_tokens': 3000,
                              'temperature': 0.01,
                              'context_length': 3000}
    )
    return llm

If i change the above method as below. I am not getting any response.

def load_llm():
    # Local CTransformers model
    MODEL_BIN_PATH = 'models/llama-2-7b-chat.ggmlv3.q8_0.bin'
    MODEL_TYPE =  'llama'
    MAX_NEW_TOKENS = 3000
    TEMPERATURE = 0.01
    context_length = 3000
    llm = CTransformers(model=MODEL_BIN_PATH,
                        model_type=MODEL_TYPE,
                        config={'max_new_tokens': MAX_NEW_TOKENS,
                                'temperature': TEMPERATURE,
                                'context_length': context_length}
                        )

    return llm

I wanted to make sure I loaded the model from a local disk instead of communicating with the Internet.

Below are my import statments.

from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader
from langchain import PromptTemplate
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.llms import CTransformers
from langchain.chains import RetrievalQA
import chainlit as cl

Appreciated your leads .....

2

There are 2 best solutions below

0
Danielle França On

I don't usually use CTransformers but I know the latest format is GGUF and GGML was discontinued, so if you are in the lastest versions of CTransformers is likely it will not run GGML anymore.

Don't know if it helps you but if you don't mind changing your approach I always used llama-cpp-python bindings and it always worked for me to run models locally.

To do this you download the GGUF version of the model you want for TheBloke.

Then run pip install llama-cpp-python (is possible the will ask for pytorch to be already installed). After is installed you can run any GGUF model using:

from llama_cpp import Llama
llm = Llama(model_path="./models/7B/llama-model.gguf", ntcx=3000)
output = llm("Q: Name the planets in the solar system? A: ", max_tokens=32, temperature=0.1)
print(output)

Full documentation on: https://github.com/abetlen/llama-cpp-python

It also have integration with Langchain.

But just changing to a GGUF model can possibly solve your problem with CTransformers.

0
j3ffyang On

Since you're trying to use LangChain to orchestrate your LLM, I would suggest one of the simplest way with ollama. Try this

from langchain_community.llms import Ollama

# llm = Ollama(model="llama2")
llm = Ollama(model="mistral")

print(llm.invoke("Tell me a joke"))

To install ollama take a reference at https://ollama.com/