Our Blog

Adding Context to a RAG-Based Chatbot Using Python and FAISS (Part 2)

Adding Context to a RAG-Based Chatbot Using Python and FAISS (Part 2)

In Part 1, I built a lightweight Retrieval-Augmented Generation (RAG) chatbot using Python, FAISS, and OpenAI — trained on articles from my site, PointsCrowd. The result: a functional chatbot that could answer domain-specific questions based on a custom knowledge base.

But while the responses were accurate, the chatbot had one major flaw: it was stateless. It couldn’t remember anything from one question to the next.

In Part 2, I set out to fix that — and teach the bot how to maintain context across turns. In this post (and in the video), I’ll walk through what worked, what didn’t, and how to build a more natural, context-aware chatbot that doesn’t require any extra models or hosted memory systems.

Why Context Matters

A basic bot might respond well to:

User: How can I get the Ritz-Carlton credit card?
Bot: [Gives helpful answer]

But completely break when followed with:

User: What did I ask about before?
Bot: I don’t know

That’s what happens when each prompt is treated in isolation.To fix this, we need the bot to retain and understand conversation history.

First Attempt: Appending History to the Prompt

Initially, I tried simply appending all past exchanges to the prompt:

prompt = "\n".join([past_turns]) + "\nUser: " + new_question

It worked briefly — but quickly confused the retrieval process. Because we’re using RAG (retrieval-augmented generation), our similarity search failed once the prompt ballooned with unrelated context (hotels → lounges → upgrades).

What Worked: Structured Prompt + Isolated Context

Instead of jamming everything together, I moved to a template-based prompt using ChatPromptTemplate, with four clear parts:

  1. Instructions
  2. Retrieved context
  3. Conversation history
  4. Current query

Here’s a simplified version:

template = ChatPromptTemplate.from_messages([
    ("system", system_instructions),
    ("user", "Context:\n{context}"),
    ("user", "Conversation history:\n{history}"),
    ("user", "Current question:\n{query}")
])

This helps the LLM treat context and chat history separately, avoiding the problem of confusing or vague inputs.

Instead of using a retriever chain, I now manually retrieve the top k documents from FAISS, filter them by a score threshold, and feed them into the prompt.

docs = db.similarity_search_with_score(
    query=query,
    k=3,
    score_threshold=0.5
)

Why the threshold?

Without it, even totally unrelated queries return something, just because k=3. That risks misleading the model with weak context. This keeps the retrieved context relevant, avoids low-quality matches, and gives the model something useful to work with.

To validate what we get, I also log the results:

log_line = "\n".join([
    f"[{score:.2f}] {doc.metadata['source']}" for doc, score in docs
])

This helps ensure retrieval is high-quality before generation kicks in.

Multi-User Memory Isolation

Once our chatbot had context, it needed memory — and memory introduced state.

I created a ConversationHistory class to handle per-user chat history, complete with token truncation:

class ConversationHistory:
    def __init__(self):
        self.history = defaultdict(list)
        self.max_tokens = 1024
        self.token_buffer = 500  # Reserved for prompt/response
    def add(self, user_id, role, content):
        self.history[user_id].append({"role": role, "content": content})
        self.truncate(user_id)
    def get(self, user_id):
        return self.history[user_id]
    def truncate(self, user_id):
        # Ensure total token usage stays within budget
        while self._token_count(self.history[user_id]) > self.max_tokens:
            self.history[user_id].pop(0)

Then, in the main API flow:

@app.post("/ask")
def ask(request: Request):
    user_id = request["user_id"]
    query = request["query"]
    # Retrieve docs
    docs = db.similarity_search_with_score(query, k=3, score_threshold=0.5)    
    # Fetch conversation history
    conversation_history = retrieve_conversation.get(user_id)
    # Format context from retrieved docs
    context = "\n\n".join([doc.page_content for doc, _ in docs])
    # Fill the template
    filled_prompt = template.format(
        context=context,
        history=conversation_history,
        query=query
    )
    # Run the LLM
    response = llm.invoke(filled_prompt)
    # Record exchange
    conversation_history[user_id].append("user", query)
    conversation_history[user_id].append("assistant", response.content)
    return {"response": response.content}

A Real Example

Turn 1

User: How can I get the Ritz-Carlton credit card?
Bot: You must first hold a Chase Bonvoy card, then you can upgrade…

Turn 2

User: What did I ask about before?
Bot: You asked how to get the Ritz-Carlton credit card from Chase.

Without context, Turn 2 would be gibberish. With it, the bot nails it.

Try It Live

I’ve deployed this updated version to PointsCrowd. You can interact with the live bot — and it’ll remember what you asked (within the session).

Stack Summary

ComponentTool
Vector DBFAISS
LLM Base modelOpenAI (GPT-4o-mini)
BackendPython, FastAPI
PromptingLangChain Templates
Memory HandlingCustom (In-memory dictionary, per-user)

What We Achieved

We now have:

  • A stateful RAG-based chatbot
  • True context-awareness across turns
  • Clean user separation and prompt safety
  • Transparent logs with retrieval scores to debug flow
  • Clear source attribution in answers

Watch the Full Video

Want to see it all in action, with code walkthroughs and debugging tips?

Watch Part 2 on YouTube

Get the code on GitHub: repo link

If you’re building your own AI assistant or just curious how these LLMs work under the hood, I hope this post helped. Always open to questions, suggestions, or “you should’ve done it this way” feedback — feel free to reach out.

Share this with your friends

Leave a Reply

Your email address will not be published. Required fields are marked *