Our Blog

Adding Context to a RAG-Based Chatbot Using Python and FAISS (Part 2)

Nick FursaPublished: 25 Apr 2025

In Part 1, I built a lightweight Retrieval-Augmented Generation (RAG) chatbot using Python, FAISS, and OpenAI — trained on articles from my site, PointsCrowd. The result: a functional chatbot that could answer domain-specific questions based on a custom knowledge base.

But while the responses were accurate, the chatbot had one major flaw: it was stateless. It couldn’t remember anything from one question to the next.

In Part 2, I set out to fix that — and teach the bot how to maintain context across turns. In this post (and in the video), I’ll walk through what worked, what didn’t, and how to build a more natural, context-aware chatbot that doesn’t require any extra models or hosted memory systems.

Why Context Matters

A basic bot might respond well to:

User: How can I get the Ritz-Carlton credit card?
Bot: [Gives helpful answer]

But completely break when followed with:

User: What did I ask about before?
Bot: I don’t know

That’s what happens when each prompt is treated in isolation.To fix this, we need the bot to retain and understand conversation history.

First Attempt: Appending History to the Prompt

Initially, I tried simply appending all past exchanges to the prompt:

prompt = "\n".join([past_turns]) + "\nUser: " + new_question

It worked briefly — but quickly confused the retrieval process. Because we’re using RAG (retrieval-augmented generation), our similarity search failed once the prompt ballooned with unrelated context (hotels → lounges → upgrades).

What Worked: Structured Prompt + Isolated Context

Instead of jamming everything together, I moved to a template-based prompt using ChatPromptTemplate, with four clear parts:

Instructions
Retrieved context
Conversation history
Current query

Here’s a simplified version:

template = ChatPromptTemplate.from_messages([
    ("system", system_instructions),
    ("user", "Context:\n{context}"),
    ("user", "Conversation history:\n{history}"),
    ("user", "Current question:\n{query}")
])

This helps the LLM treat context and chat history separately, avoiding the problem of confusing or vague inputs.

Instead of using a retriever chain, I now manually retrieve the top k documents from FAISS, filter them by a score threshold, and feed them into the prompt.

docs = db.similarity_search_with_score(
    query=query,
    k=3,
    score_threshold=0.5
)

Why the threshold?

Without it, even totally unrelated queries return something, just because k=3. That risks misleading the model with weak context. This keeps the retrieved context relevant, avoids low-quality matches, and gives the model something useful to work with.

To validate what we get, I also log the results:

log_line = "\n".join([
    f"[{score:.2f}] {doc.metadata['source']}" for doc, score in docs
])

This helps ensure retrieval is high-quality before generation kicks in.

Multi-User Memory Isolation

Once our chatbot had context, it needed memory — and memory introduced state.

I created a ConversationHistory class to handle per-user chat history, complete with token truncation:

class ConversationHistory:
    def __init__(self):
        self.history = defaultdict(list)
        self.max_tokens = 1024
        self.token_buffer = 500  # Reserved for prompt/response
    def add(self, user_id, role, content):
        self.history[user_id].append({"role": role, "content": content})
        self.truncate(user_id)
    def get(self, user_id):
        return self.history[user_id]
    def truncate(self, user_id):
        # Ensure total token usage stays within budget
        while self._token_count(self.history[user_id]) > self.max_tokens:
            self.history[user_id].pop(0)

Then, in the main API flow:

@app.post("/ask")
def ask(request: Request):
    user_id = request["user_id"]
    query = request["query"]
    # Retrieve docs
    docs = db.similarity_search_with_score(query, k=3, score_threshold=0.5)    
    # Fetch conversation history
    conversation_history = retrieve_conversation.get(user_id)
    # Format context from retrieved docs
    context = "\n\n".join([doc.page_content for doc, _ in docs])
    # Fill the template
    filled_prompt = template.format(
        context=context,
        history=conversation_history,
        query=query
    )
    # Run the LLM
    response = llm.invoke(filled_prompt)
    # Record exchange
    conversation_history[user_id].append("user", query)
    conversation_history[user_id].append("assistant", response.content)
    return {"response": response.content}

A Real Example

Turn 1

User: How can I get the Ritz-Carlton credit card?
Bot: You must first hold a Chase Bonvoy card, then you can upgrade…

Turn 2

User: What did I ask about before?
Bot: You asked how to get the Ritz-Carlton credit card from Chase.

Without context, Turn 2 would be gibberish. With it, the bot nails it.

Try It Live

I’ve deployed this updated version to PointsCrowd. You can interact with the live bot — and it’ll remember what you asked (within the session).

Stack Summary

Component	Tool
Vector DB	FAISS
LLM Base model	OpenAI (GPT-4o-mini)
Backend	Python, FastAPI
Prompting	LangChain Templates
Memory Handling	Custom (In-memory dictionary, per-user)

What We Achieved

We now have:

A stateful RAG-based chatbot
True context-awareness across turns
Clean user separation and prompt safety
Transparent logs with retrieval scores to debug flow
Clear source attribution in answers

Watch the Full Video

Want to see it all in action, with code walkthroughs and debugging tips?

Watch Part 2 on YouTube

Get the code on GitHub: repo link

If you’re building your own AI assistant or just curious how these LLMs work under the hood, I hope this post helped. Always open to questions, suggestions, or “you should’ve done it this way” feedback — feel free to reach out.

Share this with your friends