Build Your Own LLM Agent with RAG from Scratch (Python, no Agent Frameworks)

The Why

... or rather Why not use something like LangChain?

I find frameworks like LangChain makes building Agentic LLMs "feel" a lot more complicated than it actually is. In order to understand the agentic underpinnings - tool calling, context and agent loops better, we must forego these frameworks and build these primitives ourselves.

In this post, we will build Trivi-Al, a tiny chatbot that can ingest a document of your choosing and answer questions about it.

What you will need

Python 3.x
An OpenAI API Key

Setting up

1. Creating a virtual environment

From the project folder, run:

python -m venv venv

Then activate it:

source venv/bin/activate

You should now see something like (venv) at the start of your terminal prompt. This means any packages you install will go into this project environment.

2. Installing dependencies

Create a requirements.txt file and add these packages:

openai
python-dotenv
pytest

Here is what each package is for:

openai lets us call the LLM
python-dotenv lets us load secrets from a .env file
pytest lets us write a few tests as the project grows

Now install the dependencies:

python -m pip install -r requirements.txt

3. Setting the API key

Create a .env file in the project root:

OPENAI_API_KEY=your_key_here

The load_dotenv() call in main.py will load this automatically. We will get to this later.

Prompts

Create the prompt as a markdown file. This keeps the prompt easy to read and edit without touching the code.

Create a file called prompts/system_prompt.md:

Answer the user's question using only information retrieved from the indexed document.

Return only valid JSON.
Do not wrap it in markdown.
Do not include any explanation outside the JSON.

Your response must contain role, content, and tool_call.
Use role='assistant'.
Use content for the natural-language answer to the user.

Before answering any document-related question, you must request exactly one DocStore tool call.

Instructions for searching the document:
1. Extract exactly 1 keyword from the user's query that can be used for search.
2. Use search_document to retrieve k related chunks.
3. Answer based on these retrieved chunks.

Request the tool call by setting tool_call.tool_name to the tool name and tool_call.args to a JSON object containing the tool arguments.

Do not answer document-related questions from memory or prior knowledge.
Before receiving a tool result, leave content empty when requesting a tool call.

When the conversation contains a message beginning with 'Tool result from', use that tool result to answer the user's original question.
After receiving a tool result, answer only from the tool result.

If the tool result does not contain enough information, say that the indexed document does not contain enough information.

Do not request another tool call after receiving a tool result.
After receiving a tool result, set tool_call.tool_name to an empty string and tool_call.args to an empty object.

The JSON must match this schema:

{{ response_format }}

Tools:

{{ tools_registry }}

This prompt essentially lays out the 'protocol' to be used by the LLM when communicating with the user as well as the rest of the code we are about to write.

Prompt Management

Next, create src/prompt_manager.py which allows us to render our prompts dynamically at runtime.

from __future__ import annotations
from pathlib import Path


class PromptManager:
    def __init__(self, template: str):
        self.template = template

    @classmethod
    def from_file(cls, path: str | Path):
        return cls(Path(path).read_text())

    def render(self, **variables: str) -> str:
        rendered = self.template

        for key, value in variables.items():
            rendered = rendered.replace(f"{{{{ {key} }}}}", value)

        return rendered

The render method allows us to do variable interpolation at runtime like shown below.

>>> sample_template="Hello {{ name }}"
>>> pm = PromptManager(sample_template)
>>> pm.render(name="Bob")
'Hello Bob'

Testing the PromptManager

Let us write a test to ensure that prompts can be rendered correctly.

1. Writing the tests

Create tests/test_prompt_manager.py:

import pytest
from src.prompt_manager import PromptManager


@pytest.fixture
def sample_template():
    return """Replace {{ this }} with {{ that }}"""


def test_render(sample_template):
    pm = PromptManager(template=sample_template)
    rendered = pm.render(this="that", that="this")
    assert rendered == """Replace that with this"""

2. Running the tests

python -m pytest -v

Retrieval - building a simple DocStore

Out in the wild, most LLMs retrieve stuff from a VectorDB, but for our requirement we will build something basic - a simple Document store that allows us to index a document and search within it.

Create a file called src/doc_store.py:

class DocStore:
    def __init__(self):
        self.chunks = []

    def index(self, filepath: str) -> int:
        with open(filepath, "r") as f:
            content = f.read()

        self.chunks = content.lower().split("\n\n")
        return len(self.chunks)

    def search(self, query: str, k: int = 5) -> list[str]:
        matches = []
        terms = query.lower().split()

        for chunk in self.chunks:
            if any(term in chunk for term in terms):
                matches.append(chunk)

        return matches[:k]

The DocStore provides methods:

index method chunks the document into paragraphs.
search does a word based search on the chunks and returns the ones that match

Testing the DocStore

As usual, we write a few tests to ensure everything works as expected. We will test against a real file so our assertions are grounded in known content.

1. Creating a test fixture

Create data/elephants.txt with the following content.

The African elephant is the largest land animal on Earth.
Adult males can weigh up to 6,350 kilograms and stand 3 to 4 meters tall at the shoulder.
Elephants have large ears that help them regulate body temperature in hot climates.
Their trunks contain over 40,000 muscles and can lift objects weighing up to 350 kilograms.

Elephants are herbivores that consume between 150 to 300 kilograms of food per day.
They spend up to 16 hours daily eating grasses, leaves, bark, and fruit.
Due to their massive size, elephants require vast amounts of water and can drink up to 190 liters in a single day.

African Elephants live in matriarchal family groups led by the oldest female.
These herds typically consist of related females and their offspring.
Male elephants leave the herd when they reach puberty and either live alone or form loose bachelor groups.

Elephants communicate using low-frequency sounds called infrasound that travel several kilometers.
They also use body language, touch, and scent signals to communicate with each other.
Their exceptional memory helps them remember water sources and recognize other elephants after years of separation.

This gives us 4 chunks (4 paragraphs), with "elephant" in every chunk and "matriarchal" in exactly one — useful for precise assertions.

2. Writing the tests

Create tests/test_doc_store.py:

import pytest
from src.doc_store import DocStore


@pytest.fixture
def db():
    db = DocStore()
    db.index("./data/elephants.txt")
    return db


def test_index():
    db = DocStore()
    # elephants.txt has 4 paragraphs
    assert db.index("./data/elephants.txt") == 4
    assert len(db.chunks) == 4


def test_search(db):
    # "matriarchal" appears in exactly 1 paragraph
    search_results = db.search(query="matriarchal", k=3)
    assert len(search_results) == 1

    # search is case-insensitive
    search_results = db.search(query="Matriarchal", k=3)
    assert len(search_results) == 1

    # "elephant" appears in all 4 paragraphs but k=1 should limit it to 1
    search_results = db.search(query="elephant", k=1)
    assert len(search_results) == 1

    search_results = db.search(query="nonexistent", k=3)
    assert len(search_results) == 0

    # multi-term query matches chunks containing any term
    search_results = db.search(query="elephant matriarchal", k=10)
    assert len(search_results) == 4

3. Running the tests

python -m pytest -v

You should see all tests pass.

Tools

Create src/tools.py to define the available tool and its signature:

from src.doc_store import DocStore


TOOLS_REGISTRY = {
    "search_document": {"query": "", "k": 5},
}


def search_document(store: DocStore, query: str, k: int) -> list[str]:
    """Return top-k chunks that are similar to query"""
    return store.search(query, k)

TOOLS_REGISTRY describes the tools and their expected arguments. This gets injected into the system prompt so the LLM knows what it can ask for.

LLM Client

Next we create a src/llm_client.py to handle all communication with the LLM:

import json
from dataclasses import dataclass
from typing import Any, Literal

from openai import OpenAI


@dataclass
class Message:
    role: Literal["user", "assistant"]
    content: str
    tool_call: dict[str, Any] | None = None


class LLMClient:
    def __init__(self, client: OpenAI, system_prompt: str):
        self.client = client
        self.system_prompt = system_prompt

    def invoke(self, conversation: list[Message]) -> Message:
        messages = [{"role": "system", "content": self.system_prompt}]

        for message in conversation:
            messages.append({"role": message.role, "content": message.content})

        response = self.client.chat.completions.create(
            model="gpt-4o",
            messages=messages,
            response_format={"type": "json_object"},
        )
        result = response.choices[0].message.content
        try:
            parsed_result = json.loads(result)
        except json.JSONDecodeError as err:
            raise ValueError(f"Expected valid JSON from LLM, got: {result}") from err
        return Message(
            role=parsed_result["role"],
            content=parsed_result["content"],
            tool_call=parsed_result.get("tool_call"),
        )

There are a few things to pay attention to here.

The Message dataclass encapsulates each conversation turn.

Both the user as well as the assistant sends Message(s) back and forth by setting the appropriate role attribute.
It also contains a tool_call attribute that the assistant can use to express an intention to use a tool.

The invoke method generates a response from the LLM. Note that the LLM receives 3 things as part of its instructions:

The model to use to generate the response.
The entire "history" of Message(s) from previous conversation turns
The response format

The LLMClient formats the conversation, calls the API, and parses the JSON response back into a Message ready for the next conversation turn.

Agentic Loop

Now we put it all together in main.py.

1. Imports

Let's add all the imports first and get it out of the way.

# main.py

import json
from pathlib import Path
from dotenv import load_dotenv
from openai import OpenAI
from src.prompt_manager import PromptManager
from src.tools import TOOLS_REGISTRY
from src.doc_store import DocStore
from src.llm_client import LLMClient, Message
from src.tools import search_document

2. System Prompt

Create the System Prompt by combining the prompt, response format and tools registry that we defined earlier.

# main.py

RESPONSE_FORMAT = {
    "role": "assistant",
    "content": "",
    "tool_call": {"tool_name": "", "args": {}},
}

PROMPT_FILE = Path(__file__).parent / "prompts" / "system_prompt.md"

system_prompt = PromptManager.from_file(PROMPT_FILE).render(
    response_format=json.dumps(RESPONSE_FORMAT),
    tools_registry=json.dumps(TOOLS_REGISTRY),
)

3. The LLM Wrapper

Next, we initialize DocStore and the LLM client. The load_dotenv method will load the OPENAI_API_KEY from the .env file

# main.py

docs = DocStore()
_ = docs.index("data/elephants.txt")

_ = load_dotenv()
client = OpenAI()

llm = LLMClient(client, system_prompt)

4. handle_turn

This is where we create two important "abilities" of our chatbot: Memory and Tool Calling

The handle_turn method handles one "full" turn of the conversation:

it appends the user message to the conversation list which serves as our agent memory
it calls the LLM, runs any requested tool, then calls the LLM again with the result.
The conversation list is passed by reference so every turn accumulates messages in the caller's list — giving the LLM full history on each call.

# main.py

def handle_turn(
    query: str,
    conversation: list[Message],
    llm_client: LLMClient,
    doc_store: DocStore,
) -> Message:
    conversation.append(Message(role="user", content=query))
    ai_msg = llm_client.invoke(conversation)
    conversation.append(ai_msg)

    if ai_msg.tool_call and ai_msg.tool_call.get("tool_name"):
        tool_name = ai_msg.tool_call["tool_name"]
        tool_args = ai_msg.tool_call.get("args", {})

        if tool_name == "search_document":
            tool_result = search_document(doc_store, **tool_args)
        else:
            tool_result = f"Unknown tool: {tool_name}"

        tool_msg = Message(role="user", content=f"Tool result from {tool_name}: {tool_result}")
        conversation.append(tool_msg)

        final_msg = llm_client.invoke(conversation)
        conversation.append(final_msg)
        return final_msg

    return ai_msg

5. Agent Loop

Our Agent Loop is a simple while loop.

# main.py

conversation = []

while True:
    user_input = input("User: ")

    if user_input.lower() == "exit":
        break

    response = handle_turn(user_input, conversation, llm, docs)
    print(response.content)

Voila! We have a working chatbot.

Running the App

python main.py

Type exit to quit.

Here is Trivi-Al interacting with Al who likes to know about Elephants

User: Hi, my name is Al
Hello Al! How can I assist you today?
User: How strong are elephants?
Elephants are incredibly strong animals. Their trunks contain over 40,000 muscles and are capable of lifting objects weighing up to 350 kilograms. Adult male African elephants can weigh up to 6,350 kilograms, highlighting their massive build and strength.
User: What is my name, I seem to have forgotten.
Your name is Al.
User: exit

What We Built

In this post we built an LLM agent from scratch — no LangChain, no agent frameworks, just Python.

Let's do a quick recap of the various pieces we built:

DocStore — a retrieval system that chunks a document and searches it using keyword matching, giving the LLM grounded context instead of relying on its training data.
PromptManager — a lightweight template engine that injects the tool registry and response schema into the system prompt at runtime.
Custom tool-calling protocol — rather than using OpenAI's built-in function calling, we defined our own JSON schema so the LLM can express tool requests. This makes the protocol explicit and easy to inspect.
Agentic loop — a handle_turn function that gives the agent memory (the conversation list) and the ability to act (fetch context, then answer).

The result is a chatbot that can answer questions about any document you give it, remember the conversation, and tell you when it doesn't know something.

Hopefully, this exercise has given you a deeper understanding of how these Agents actually work under the hood.

If you want to find a fuller implementation of Trivi-Al, check out the repo.