What's the most efficient way to perform Retrieval Augmented Generation on a user data object without resending context data every time?

120 Views Asked by GUZZ At 16 August 2025 at 10:59

I have a set object of User data with multiple properties like Name, Age, SkillsList, HobbiesList, DiaryEntryList, etc, and I want to make questions ("what's your favourite hobby?") about that data without resending the data every time to my LLM (at the moment ChatGPT). I researched and I found that I could improve the LLM's long term memory with RAG (Retrieval Augmented Generation). I know I have to break down my data for query granularity purposes and create the embedding for each 'chunk', while also embedding the full User object so that relations between properties are captured, so I intend to do that on several levels from bottom (each property's property) to middle (each User property) to top (full User object). But some nuances are still not clear to me... Using Pinecone DB, I know I can index the embeddings and I do understand I can perform the Nearest Neighbor similarity search with Pinecone (or any Vector db) using cosine similarity, dot product or euclidean distance, in this case, between the target question and the relevant piece of user data. But according to Pinecone's QA bot (on their website), it says I would have to retrieve back the relevant piece of User data from Pinecone and send it back to the LLM each time. If the piece of data is a huge text for the given granularity's response to make sense, this is bad... This means I would always have to resend data to my model for each question asked, which incurs in a lot of tokens for each query and potentially inflates the price for both the query and, possibly, the response. I've asked Pinecone's bot how to solve this matter, but he always refers back to resending the context every time.

What is the most efficient way to do this? If this is somehow wrong, why is it and what's the alternative? Please note, I'm using Pinecone.Net (a wrapper for Pinecone in .Net) and I'm using OpenAI's Embeddings API with the Ada embeddings model.

What tools would you recommend for doing this for free while maintaining effectiveness (VectorDB + LLM)?

Why aren't any LLMs prepared for long term memory natively?

Much appreciated for any help I can get!

(My question is an exploration of the correct way to do this)

Original Q&A

What's the most efficient way to perform Retrieval Augmented Generation on a user data object without resending context data every time?

There are 0 best solutions below

Related Questions in NLP

Related Questions in OPENAI-API

Related Questions in EMBEDDING

Related Questions in PINECONE

Related Questions in RETRIEVAL-AUGMENTED-GENERATION

Trending Questions

Popular # Hahtags

Popular Questions