Skip to content

LukeSantossz/sb100_agents

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

348 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Python FastAPI CI License

SmartB100 — Agriculture RAG Agent

Self-hostable RAG assistant for agricultural technical support: it answers questions grounded in your own PDF manuals, adapts the response to the reader's expertise, and tags every answer with a continuous 0.0–1.0 semantic-entropy hallucination score so users know when to double-check.


What It Does

SmartB100 turns a folder of agricultural PDFs into a question-answering service backed by a local LLM, grounding every answer in retrieved content.

  • Grounded Q&A — indexes PDF manuals into a vector database and answers questions from the retrieved chunks, not from model memory.
  • Expertise-adaptive answers — the same RAG context is rendered for beginner, intermediate, or expert readers via profile-aware system prompts.
  • Hallucination scoring — semantic entropy over multiple candidate answers produces a continuous 0.0–1.0 score flagging low-confidence responses.
  • Authenticated API — bcrypt password hashing + JWT-gated /chat, with per-IP rate limiting on login and registration.
  • Runs fully local — Ollama serves both chat and embeddings; no paid API key is required to operate the core pipeline.

What It Is

SmartB100 is a REST API (FastAPI) with an optional Gradio web UI that converts a corpus of agricultural PDFs into a source-grounded chat service. It targets agricultural extension workers and agronomists who need fast, reliable answers about crop management, soil treatment, pest control, and planting schedules — without manually searching dense technical manuals.

Tech Stack

Layer Technology
Language Python 3.12+
API / Runtime FastAPI, Uvicorn
UI Gradio
Vector DB Qdrant (archives_v2, 768-dim embeddings)
Inference Ollama — llama3.2:3b (chat) + nomic-embed-text (embeddings)
Verification Multi-provider semantic entropy (Groq / Ollama / OpenRouter)
Persistence SQLite (auth + conversation history)
Auth bcrypt + JWT (passlib, slowapi rate limiting)
Testing / CI pytest, ruff, mypy --strict, GitHub Actions
Packaging uv, Docker (multi-stage Dockerfile.api)

Architecture

Architectural Style

SmartB100 is a modular monolith with composed deployment:

  • One application process. api/main.py loads every domain module (api/routes/*, core/*, retrieval/*, memory/*, generation/*, verification/*, database/*) into a single FastAPI runtime. Inter-module communication is function calls inside the same Python interpreter — no RPC, no message broker, no queue.
  • Eight internal layers, one binary. The folder boundary is a convention for testability and review; it is not a network boundary.
  • External processes are limited to genuine third-party services. No domain code lives outside the API process.

External components (each runs in its own process):

Component Role Containerized? Protocol
Qdrant Vector DB (archives_v2 collection, 768-dim embeddings) Yes — docker compose --profile infra HTTP REST :6333 + gRPC :6334
Ollama LLM chat (llama3.2:3b) + embeddings (nomic-embed-text) No — runs on the host HTTP REST :11434 via OLLAMA_HOST
SQLite Auth + conversation history No (filesystem) Bind-mount ./smartb100_v2.db:/app/smartb100_v2.db

Client tier (two paths):

  • Gradio UI (ui/chat_ui.py) — stateless HTTP client containerized via docker compose --profile app. Calls only POST /chat. Does not import any domain module — it is a UI shell, not a microservice.
  • Direct HTTPcurl, scripts, future mobile clients. Same endpoint, same JSON contract.

Why not microservices. The RAG pipeline (embed → search → generate → verify) shares the same ChatRequest/ChatResponse model and runs synchronously within a single request. Splitting any step into its own service would add network latency between calls that are currently in-process, plus contract-versioning overhead, without delivering independent scaling benefit at current load.

When to reconsider. If verification/ (entropy sampling, the slowest step) needs to scale independently of generation/, or if the workload grows beyond ~500 req/s, the verification gate is the natural extraction point — it already has a clean async-friendly interface (evaluate(question, context, answer)).

flowchart TD
    subgraph CLIENT["Client"]
        GRADIO["Gradio UI\n:7860"]
        CURL["curl / HTTP"]
    end

    subgraph API["API Layer"]
        ENDPOINT["POST /chat"]
        AUTH["POST /auth/*"]
        HEALTH["GET /health"]
    end

    subgraph PIPELINE["RAG Pipeline"]
        EMBED["Embedder\nOllama nomic-embed-text\n768 dims"]
        SEARCH["Vector Search\nCosine Similarity"]
        MEMORY["ConversationBuffer\nFIFO deque (maxlen=10)"]
        PROFILE["Profiling\nbeginner | intermediate | expert"]
        LLM["LLM Generator\nOllama llama3.2:3b"]
    end

    subgraph VERIFY["Verification"]
        ENTROPY["Semantic Entropy\nMulti-provider (Groq/Ollama/OpenRouter)"]
        GATE["Verification Gate\nRetry + Fallback"]
    end

    subgraph DATA["Data Layer"]
        QDRANT[("Qdrant\n:6333\narchives_v2")]
        SQLITE[("SQLite\nusers / conversations")]
    end

    GRADIO -->|HTTP JSON| ENDPOINT
    CURL -->|HTTP JSON| ENDPOINT

    ENDPOINT --> EMBED
    EMBED --> SEARCH
    SEARCH --> QDRANT

    ENDPOINT --> MEMORY
    MEMORY -.->|history| LLM
    SEARCH -->|context| PROFILE
    PROFILE --> LLM

    LLM --> GATE
    GATE -->|verification_enabled| ENTROPY
    ENTROPY -->|score| GATE
    GATE -->|retry if high entropy| LLM

    GATE --> RESPONSE["ChatResponse\n{answer, hallucination_score}"]

    AUTH --> SQLITE
Loading

RAG Pipeline Flow:

sequenceDiagram
    participant C as Client
    participant A as API /chat
    participant E as Embedder
    participant Q as Qdrant
    participant G as LLM Generator
    participant V as Verification Gate

    C->>A: POST /chat {session_id, question, profile}
    A->>E: generate_embedding(question)
    E-->>A: vector[768]
    A->>Q: search_context(vector, top_k=3)
    Q-->>A: chunks[]
    A->>G: generate(question, context, history, profile)
    G-->>A: answer
    alt verification_enabled
        A->>V: evaluate(question, context, answer)
        V-->>A: {answer, hallucination_score}
    end
    A-->>C: ChatResponse {answer, hallucination_score}
Loading

Deployment Topology:

flowchart LR
    subgraph CLIENTS["Clients"]
        direction TB
        BROWSER["Browser"]
        SCRIPTS["curl / scripts"]
    end

    subgraph HOST["Developer host"]
        OLLAMA["Ollama :11434<br/>llama3.2:3b + nomic-embed-text"]
    end

    subgraph COMPOSE["docker-compose stack"]
        direction TB
        subgraph INFRA["profile: infra"]
            QDRANT[("Qdrant<br/>:6333 REST / :6334 gRPC")]
        end
        subgraph APP["profile: app"]
            API["FastAPI :8000<br/>monolith binary"]
            GRADIO["Gradio :7860"]
            SQLITE[("SQLite<br/>bind-mount")]
        end
    end

    BROWSER -->|HTTP| GRADIO
    SCRIPTS -->|HTTP /chat| API
    GRADIO -->|HTTP /chat| API
    API -->|HTTP REST| QDRANT
    API -->|HTTP /api/chat,<br/>/api/embeddings| OLLAMA
    API -. SQLAlchemy .-> SQLITE
Loading

The first two diagrams are logical (what runs); the last is topological (where it runs). They complement, not duplicate.

Engineering Decisions

A curated index of the most significant decisions; each row links the ADR that holds the full rationale, alternatives, and consequences.

Decision Alternative considered Rationale
Modular monolith Microservice per RAG step Shared request model, synchronous pipeline — ADR-0001
Semantic entropy for the hallucination score Binary classifier / LLM-as-judge Continuous 0.0–1.0 score with no labeled data — ADR-0002
Local-first inference via Ollama Hosted embeddings / larger hosted model Offline, free, stable embedding space — ADR-0003
Multi-provider verification dispatch OpenAI-only verification Removes the hard paid dependency — ADR-0004
Synchronous /chat handler async def handler Threadpool keeps the event loop free — ADR-0005
bcrypt + JWT gate on /chat Session cookies / static API keys Stateless, instantly revocable auth — ADR-0006
SQLite for persistence PostgreSQL Zero-ops at single-node scale — ADR-0007
Deepagents on LangGraph as the agent substrate Raw LangGraph / hand-rolled loop Built-in planning, sub-agents, filesystem; isolated behind agent/ADR-0008
Hosted Groq (GPT-OSS) for the agent reasoning tier Larger local model / Claude No local GPU; reuses the default verification provider; reliable tool-calling — ADR-0009

Getting Started

Prerequisites

Installation

git clone https://github.com/LukeSantossz/sb100_agents.git
cd sb100_agents

# Pull inference models
ollama pull llama3.2:3b && ollama pull nomic-embed-text

# Install dependencies
uv sync                            # or: python -m venv .venv && .venv/bin/pip install -e .

# Configure environment (defaults work for local dev)
cp .env.example .env

Running

# 1. Start Qdrant
docker compose --profile infra up -d

# 2. Index documents (first run only)
.venv/bin/python database/semantic_chunker.py index ./archives/

# 3. Start API
.venv/bin/python -m uvicorn api.main:app --reload

# 4. (Optional) Start Gradio UI
.venv/bin/python ui/chat_ui.py

Windows users: replace .venv/bin/python with .venv\Scripts\python.exe, or run .\start.bat / .\start.ps1 after installation.

Full Docker deployment: docker compose --profile infra --profile app up -d. The compose stack uses a multi-stage Dockerfile.api (no build-essential in the final image), healthchecks that gate depends_on ordering, and log rotation (max-size: 10m, max-file: 3). On Linux the OLLAMA_HOST override is required — see SETUP.md §9.1. See SETUP.md for remote Qdrant configuration.

Verify the stack is up:

curl http://localhost:6333/healthz           # Qdrant: "healthz check passed"
curl http://localhost:8000/health            # API: {"status":"ok"}

Tests

pytest tests/ -m "not requires_infra"   # full suite, infra-bound tests excluded (CI default)
ruff check .                                           # lint
mypy retrieval/ generation/ memory/ --strict          # type check

API Reference

Endpoint Description
POST /chat RAG query (requires JWT); returns answer with hallucination score
POST /auth/register Creates new user (rate-limit 3/hour per IP)
POST /auth/token OAuth2 login; returns JWT (rate-limit 5 / 15min per IP)
GET /health API health status

POST /chat:

TOKEN=$(curl -s -X POST "http://localhost:8000/auth/token" \
  -d "username=demo&password=long-enough-pw" | jq -r .access_token)

curl -X POST "http://localhost:8000/chat" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "session_id": "demo-session",
    "question": "Qual a epoca ideal de plantio da soja?",
    "profile": {"name": "User", "expertise": "beginner"}
  }'
# {"answer": "...", "hallucination_score": 0.18}

Without the Authorization header the API returns 401 Unauthorized.

Request Field Type Description
session_id string UUID for conversation continuity
question string User query
profile.expertise enum beginner | intermediate | expert
Response Field Type Description
answer string Generated response adapted to expertise level
hallucination_score float 0.0 (grounded) to 1.0 (likely hallucinated)

Project Structure

sb100_agents/
├── api/                            # FastAPI backend
│   ├── main.py                     # App entry (CORS + routers + lifespan)
│   └── routes/                     # chat.py, auth.py, health.py
├── core/                           # Pydantic schemas & configuration
├── retrieval/                      # Embeddings + Qdrant vector search
├── generation/                     # LLM response generation
├── memory/                         # Conversation buffer (FIFO)
├── verification/                   # Semantic entropy + verification gate
├── database/                       # SQLite + PDF semantic chunking
├── eval/                           # 5-step evaluation pipeline
├── ui/                             # Gradio chat interface
├── tests/                          # Unit + integration tests
├── .github/workflows/              # CI + Claude Code automation
├── Dockerfile.api                  # Multi-stage build (builder + runtime)
├── docker-compose.yml              # Qdrant (infra) + API+Gradio (app) with healthchecks
└── pyproject.toml

Project Status

Status: MVP complete — actively hardened.

Done

  • PDF indexing pipeline (semantic chunking → Qdrant)
  • RAG chat with expertise-adaptive responses
  • Semantic-entropy hallucination scoring (multi-provider)
  • bcrypt + JWT auth with per-IP rate limiting
  • Dockerized deployment (infra + app profiles, healthchecks, log rotation)
  • 5-step offline evaluation pipeline (eval/)
  • Test suite (205 tests, ~83% coverage) with CI: ruff + mypy --strict + pytest

Pending

  • Raise critical-module coverage to a 70% CI gate
  • Optional Langfuse tracing for the RAG pipeline
  • Hybrid search (dense + sparse vectors, RRF fusion)
  • LangGraph migration (ReAct agent + agricultural intent filter)
  • Claim verification (atomic decomposition + RAG fact-checking)
  • Streaming responses (SSE)

The pending work is sequenced into delivery Waves in the agentic migration roadmap.

Known Issues & Limitations

  • CPU inference latencyllama3.2:3b with RAG context can take minutes per answer on CPU-only hosts. A configurable CHAT_TIMEOUT (default 600s) plus transient-error retries exist for this reason; the limitation disappears with a GPU or a hosted provider.
  • Single-node persistence — SQLite is single-writer. It fits one API process but does not support horizontal scaling; PostgreSQL is the migration path once writes contend.
  • Windows + Docker bind mount — if ./smartb100_v2.db does not already exist as a file, Docker Desktop may create it as a directory. Create the empty file before docker compose --profile app up; the API raises an explicit RuntimeError if it finds a directory.
  • Coverage gate is conservative — the CI coverage threshold is currently below the 70% target on critical modules. Raising it is in progress (see Project Status).
  • Breaking auth change — users created before the bcrypt + JWT gate (SHA-256 hashes) must be re-registered.
  • Verification adds latency — entropy sampling generates multiple candidate answers. It is opt-in via VERIFICATION_ENABLED and falls back to a neutral score on failure rather than blocking the answer.

Contributing

See CONTRIBUTING.md. Quick summary: fork, branch (type/NNN-short-description, NNN = issue number), tests, Conventional Commits, PR.

License

MIT License

About

RAG-based agricultural Q&A system using FastAPI, Qdrant vector database, and Ollama LLM for semantic document retrieval and response generation from PDF knowledge bases.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages