Adding Context to a RAG-Based Chatbot Using Python and FAISS (Part 2)

In Part 1, I built a lightweight Retrieval-Augmented Generation (RAG) chatbot using Python, FAISS, and OpenAI — trained on articles from my site, PointsCrowd. The result: a functional chatbot that could answer domain-specific questions based on a custom knowledge base.
But while the responses were accurate, the chatbot had one major flaw: it was stateless. It couldn’t remember anything from one question to the next.
In Part 2, I set out to fix that — and teach the bot how to maintain context across turns. In this post (and in the video), I’ll walk through what worked, what didn’t, and how to build a more natural, context-aware chatbot that doesn’t require any extra models or hosted memory systems.
Why Context Matters
A basic bot might respond well to:
User: How can I get the Ritz-Carlton credit card?
Bot: [Gives helpful answer]
But completely break when followed with:
User: What did I ask about before?
Bot: I don’t know
That’s what happens when each prompt is treated in isolation.To fix this, we need the bot to retain and understand conversation history.
First Attempt: Appending History to the Prompt
Initially, I tried simply appending all past exchanges to the prompt:
prompt = "\n".join([past_turns]) + "\nUser: " + new_question
It worked briefly — but quickly confused the retrieval process. Because we’re using RAG (retrieval-augmented generation), our similarity search failed once the prompt ballooned with unrelated context (hotels → lounges → upgrades).
What Worked: Structured Prompt + Isolated Context
Instead of jamming everything together, I moved to a template-based prompt using ChatPromptTemplate, with four clear parts:
- Instructions
- Retrieved context
- Conversation history
- Current query
Here’s a simplified version:
template = ChatPromptTemplate.from_messages([
("system", system_instructions),
("user", "Context:\n{context}"),
("user", "Conversation history:\n{history}"),
("user", "Current question:\n{query}")
])
This helps the LLM treat context and chat history separately, avoiding the problem of confusing or vague inputs.
Instead of using a retriever chain, I now manually retrieve the top k documents from FAISS, filter them by a score threshold, and feed them into the prompt.
docs = db.similarity_search_with_score(
query=query,
k=3,
score_threshold=0.5
)
Why the threshold?
Without it, even totally unrelated queries return something, just because k=3. That risks misleading the model with weak context. This keeps the retrieved context relevant, avoids low-quality matches, and gives the model something useful to work with.
To validate what we get, I also log the results:
log_line = "\n".join([
f"[{score:.2f}] {doc.metadata['source']}" for doc, score in docs
])
This helps ensure retrieval is high-quality before generation kicks in.
Multi-User Memory Isolation
Once our chatbot had context, it needed memory — and memory introduced state.
I created a ConversationHistory class to handle per-user chat history, complete with token truncation:
class ConversationHistory:
def __init__(self):
self.history = defaultdict(list)
self.max_tokens = 1024
self.token_buffer = 500 # Reserved for prompt/response
def add(self, user_id, role, content):
self.history[user_id].append({"role": role, "content": content})
self.truncate(user_id)
def get(self, user_id):
return self.history[user_id]
def truncate(self, user_id):
# Ensure total token usage stays within budget
while self._token_count(self.history[user_id]) > self.max_tokens:
self.history[user_id].pop(0)
Then, in the main API flow:
@app.post("/ask")
def ask(request: Request):
user_id = request["user_id"]
query = request["query"]
# Retrieve docs
docs = db.similarity_search_with_score(query, k=3, score_threshold=0.5)
# Fetch conversation history
conversation_history = retrieve_conversation.get(user_id)
# Format context from retrieved docs
context = "\n\n".join([doc.page_content for doc, _ in docs])
# Fill the template
filled_prompt = template.format(
context=context,
history=conversation_history,
query=query
)
# Run the LLM
response = llm.invoke(filled_prompt)
# Record exchange
conversation_history[user_id].append("user", query)
conversation_history[user_id].append("assistant", response.content)
return {"response": response.content}
A Real Example
Turn 1
User: How can I get the Ritz-Carlton credit card?
Bot: You must first hold a Chase Bonvoy card, then you can upgrade…
Turn 2
User: What did I ask about before?
Bot: You asked how to get the Ritz-Carlton credit card from Chase.
Without context, Turn 2 would be gibberish. With it, the bot nails it.
Try It Live
I’ve deployed this updated version to PointsCrowd. You can interact with the live bot — and it’ll remember what you asked (within the session).
Stack Summary
Component | Tool |
Vector DB | FAISS |
LLM Base model | OpenAI (GPT-4o-mini) |
Backend | Python, FastAPI |
Prompting | LangChain Templates |
Memory Handling | Custom (In-memory dictionary, per-user) |
What We Achieved
We now have:
- A stateful RAG-based chatbot
- True context-awareness across turns
- Clean user separation and prompt safety
- Transparent logs with retrieval scores to debug flow
- Clear source attribution in answers
Watch the Full Video
Want to see it all in action, with code walkthroughs and debugging tips?
Watch Part 2 on YouTube
Get the code on GitHub: repo link
If you’re building your own AI assistant or just curious how these LLMs work under the hood, I hope this post helped. Always open to questions, suggestions, or “you should’ve done it this way” feedback — feel free to reach out.