Fact retrieval using Mistral-7B-v0.1 (base mode)

396 Views Asked by At

Context

I pre-trained the mistral Mistral-7B-v0.1 (base model) using the pretrain_chinese_llama_lora.ipynb script provided in the Chinese-LLaMA-Alpaca repository on the Github.

I trained the base model for text completion task using on 50 lines of text containing facts - places, persons, historical & geographical facts.

The lines representing the facts about a single entity (like place, a person,....) are not continuous. For example:

<fact #1 about New York>

<fact #1 about John Doe>

<fact #2 about John Doe>

<fact #1 about a river and geography>

<fact #2 about New York>

... 
... 
...

<fact #3 about New York>

Now my goal is to retrieve all the facts about New York using text completion prompt after pre-training the model for text completion task.

My Observation

I see that even after using the Diverse beam search decoding, the model is not able to retrieve all the facts/context related to New York.

What did I try?

The snippet for the inference is as follows:

with torch.no_grad():
    outputs = pt_model.generate(**model_input, max_new_tokens=100, repetition_penalty=1.15,
                                    num_beams=15, num_beam_groups=15, diversity_penalty=2.0,
                                    num_return_sequences=15)
    model_output = tokenizer.batch_decode(outputs, skip_special_tokens=True)

Why did I didn't use database / RAG:

  1. I have a 1000s of PDFs of data which I can't obviously go through and create DB scripts to store the facts for the entities I am interested into.
  2. RAG will possibly induce similar (not exact) facts based on similarity search which I want to avoid.
  3. I want to further fine tune this model for Q&A on my specific use-case. Hence I need to retrieve as much as facts as possible.
0

There are 0 best solutions below