Flowgent

A complete solution for the Aivar Innovations Discovery Agent challenge, covering all three levels with verified results and 58 passing unit tests.

Companies often lack a clear picture of which software tools they actually use, and which of those tools should be exchanging data but are not. Flowgent reads a company's existing documents and produces that picture in three steps: a verified system inventory, a prioritized list of missing integrations, and runnable Python connector code for the highest-impact gaps. Every output is grounded in source text with no hallucinations.

This repository includes samples/apex_digital/, a nine-document sample corpus for a fictional company. The results below came from running the full pipeline against that dataset:

  DEMO RESULTS  (samples/apex_digital, included in this repo)
  ────────────────────────────────────────────────────────────
  Level 1  |  27 systems found   |  0 hallucinations  |  9/9 docs
  Level 2  |  14 gaps identified |  7 workflows mapped
  Level 3  |  3 connectors built |  3/3 pass unit tests
  ────────────────────────────────────────────────────────────

What This Project Does

Most companies use dozens of software tools: Slack, QuickBooks, Shopify, and so on. Nobody has a clean list of what is actually installed, and nobody knows which tools should be talking to each other but are not, or how many hours of manual work that gap costs each week.

Flowgent solves this in three steps:

Level 1: Systems Discovery reads company documents (PDFs, spreadsheets, Word files, images, HTML exports) and builds a clean, verified list of every software tool mentioned. If a tool cannot be cited with a verbatim quote from a source file, it is not included.
Level 2: Gap Analysis compares the tool inventory against automation goals, then identifies every missing connection: which integrations do not exist, how hard they would be to build, and how much manual time they cost.
Level 3: Code Generation writes actual, runnable Python code to connect the highest-priority tool pairs. Not boilerplate: real auth handling, pagination, retry logic, typed errors, and unit tests.

The Hallucination Problem and How It Is Solved

The core design rule: the LLM proposes, code disposes.

AI models make things up. They will confidently name a software system that never appeared in any document, or invent an API endpoint that does not exist. Every level of this pipeline treats the LLM as the only untrusted component and passes every LLM output through a deterministic code guard rail before accepting it:

Level	LLM proposes	Code verifies by
1	A system name and a verbatim quote from the source document	Substring-searching the actual source text. Rejects anything not physically present in the file.
2	Which systems are involved in a workflow and whether they are connected	Cross-referencing against the verified Level 1 inventory. Marks a link live only if a stored evidence quote proves it.
3	What entities and fields a connector should map	Rendering code from hand-written, pre-tested templates. The LLM never writes auth, retry, or pagination logic.

Confidence scores and effort estimates are computed by code, never by the model. The full reasoning trail is written to pipeline_log.jsonl and every decision is auditable.

System Architecture

Pipeline Overview

Level 1: Systems Discovery

Input: Any folder of business documents (PDF, DOCX, XLSX, HTML, PNG, CSV, and more)

What happens inside:

Format Router detects the real file type using magic bytes, not the file extension. A .pdf that is actually a Word document gets caught and routed correctly.
Text Chunker splits long documents into overlapping segments so no mention of a system is lost at a page boundary.
LLM Extraction asks the model to identify software systems and return a verbatim quote from the chunk as evidence. This output is untrusted.
Quote Guard Rail checks that the returned quote is a literal substring of the source chunk. If it is not, the finding is rejected. This is the hallucination firewall.
Taxonomy Resolver maps 200+ known aliases ("SFDC", "Salesforce CRM", "salesforce.com") to canonical system names so duplicates across documents collapse correctly.
Deterministic Scorer calculates a confidence score in code based on how many independent documents mention the system, how strong the evidence quote is, and what category of system it belongs to. The model never assigns its own confidence.

Output: system_inventory.json and report.md

Level 2: Gap Analysis

Input: system_inventory.json from Level 1, and optionally a use_cases.json file listing the automation workflows that matter to the business.

Note on use_cases.json: This file is completely optional. Without it, the agent infers generic workflow goals from the discovered system inventory. Providing it focuses the analysis on the specific integrations the business actually cares about and generally produces more relevant gap rankings. See samples/apex_digital/use_cases_apex.json for the expected format.

What happens inside:

Workflow Mapper asks the LLM which systems are involved in each use case and what data flows between them. Output is untrusted.
Inventory Guard Rail cross-references every system the LLM named against the verified Level 1 inventory. Any system not in the inventory is rejected, preventing the model from inventing connections to systems that do not exist.
Connection Checker marks each connection as live, missing, or blocked. A connection is marked live only if an evidence quote in the inventory proves it. Blocked means a required prerequisite system is absent.
Effort and ROI Scorer estimates implementation effort in hours and computes a business impact multiplier using a deterministic formula. The model never estimates effort.

Output: gap_analysis.json and gap_analysis_report.md

Level 3: Code Generation

Input: gap_analysis.json from Level 2, specifically the top-N highest-priority gaps.

A naive implementation would ask the LLM to write a Shopify-to-QuickBooks connector. It would confidently invent endpoint paths, authentication headers, and pagination schemes that look plausible but are wrong. This fails the zero-hallucination requirement.

What happens inside:

Spec Registry Lookup (spec_registry.py) checks a curated database of real, verifiable API facts: base URLs, authentication methods (OAuth2, API key, Bearer token), and pagination styles (cursor, page-number, Link header). For any system in the registry, these facts are treated as ground truth.
LLM Spec Resolution (spec_resolver.py) asks the model to propose which entities to connect and how fields map between them. For registry systems, the model's proposed API facts are overridden by registry data. For unknown systems, the output is flagged as unverified.
Template Renderer (templates.py) generates the actual Python code from hand-written, unit-tested templates. The auth handshake, retry loop with exponential backoff, rate limiter, pagination iterator, and typed error taxonomy are written by a human engineer. The LLM never touches this logic.
Validation Guard Rail (validation.py) parses the generated code as an AST to catch syntax errors, imports the module to verify it loads, then runs the generated pytest suite in a sandboxed subprocess with a strict wall-clock timeout.

Output: outputs/connectors/<gap_id>/ per connector, plus generation_manifest.json and generation_report.md

What Each Generated Connector Contains

Each connector package lands in outputs/connectors/<gap_id>/:

File	What is inside
`connector.py`	Environment-variable auth, full CRUD per entity, transparent pagination iterator, rate limiter, retries with exponential backoff, typed error classes, and a `run()` workflow method.
`test_connector.py`	Offline unit tests covering auth headers, pagination behavior, 429 rate-limit retries, HTTP error handling, create operations, and field mapping. Minimum 7 tests per connector.
`agent.yaml`	Agent definition with system prompt, tool bindings wired to the real connector methods, workflow steps, and guardrails.
`README.md`	Per-connector setup instructions, credential environment variable list, and usage examples.
`requirements.txt`	Dependency manifest for the connector package.

Readiness scores are computed per connector:

Registry API with ground-truth facts available: approximately 85% ready
AI-inferred API with facts flagged as unverified: approximately 60% ready
Any connector that fails its own generated tests: capped at 45%

Technology Stack

Component	Choice	Why
Language	Python 3.11+	Best LLM and document tooling ecosystem.
LLM provider	Groq	Fastest inference (LPU hardware), OpenAI-compatible API.
Primary model	`llama-3.3-70b-versatile`	Strong structured extraction, reliable JSON mode.
Fallback chain	`gpt-oss-120b` then `qwen3-32b`	Independent rate-limit pools: if primary is throttled, the fallback picks up automatically.
Vision model	`llama-4-scout-17b`	Multimodal, handles scanned documents and image-only PDFs.
Schema validation	Pydantic v2	Strict type enforcement on every LLM response: malformed output fails fast.
HTTP client (connectors)	`requests`	Ubiquitous, human-reviewable, no magic.
Document parsers	PyMuPDF, python-docx, openpyxl, BeautifulSoup4	Native, battle-tested parsers for PDF, Word, Excel, and HTML.
Rate limiting	Token-bucket with exponential backoff and jitter	Handles Groq per-model rate limits gracefully at scale.

Project Layout

FLOWGENT_CODEBASE/
│
├── main.py                          # CLI entry point: parses flags and routes to the correct pipeline
├── requirements.txt                 # All pinned Python dependencies
├── pytest.ini                       # Test runner configuration (rootdir, test discovery paths)
├── .env.example                     # Template showing every required environment variable
├── writeup.md                       # Architecture decisions and trade-offs
├── discovery_agent/                 # Core engine: all pipeline logic lives here
│   │
│   ├── config.py                    # Loads .env, validates required keys, exposes settings object
│   ├── models.py                    # Pydantic schemas for all Level 1 data structures
│   ├── observability.py             # Structured JSON logging to pipeline_log.jsonl
│   ├── llm.py                       # Resilient Groq API client: rate limiter, backoff, fallback chain, JSON repair
│   ├── utils.py                     # Shared helpers: JSON recovery, text cleaning, retry wrappers
│   ├── taxonomy.py                  # Alias resolution: maps 200+ system name variants to canonical names
│   ├── chunking.py                  # Splits document text into overlapping segments for LLM processing
│   ├── extraction.py                # Level 1 LLM prompts: asks model to identify systems and return evidence quotes
│   ├── verification.py              # Anti-hallucination guard rail: substring-checks every LLM quote against source
│   ├── aggregation.py               # Deduplicates system findings across multiple documents and chunks
│   ├── scoring.py                   # Deterministic confidence calculator: code-based, never LLM-based
│   ├── pipeline.py                  # Level 1 orchestrator: wires ingest, extract, verify, score, output
│   │
│   ├── ingestion/                   # Document format parsers, one module per file type
│   │   ├── __init__.py              # Format router: dispatches each file to the right parser using magic bytes
│   │   ├── pdf_parser.py            # PDF extraction via PyMuPDF; falls back to vision model for image-only pages
│   │   ├── docx_parser.py           # Word document extraction via python-docx
│   │   ├── xlsx_parser.py           # Excel extraction via openpyxl; handles formulas and multi-sheet files
│   │   ├── html_parser.py           # HTML extraction via BeautifulSoup4; strips tags and navigation boilerplate
│   │   ├── image_parser.py          # PNG/JPG transcription via llama-4-scout-17b vision model
│   │   └── text_parser.py           # Plain text and CSV fallback parser
│   │
│   ├── data/
│   │   └── taxonomy.json            # 200+ curated system name aliases (example: SFDC maps to Salesforce)
│   │
│   ├── gap_models.py                # Pydantic schemas for all Level 2 data structures (gaps, workflows, connections)
│   ├── gap_pipeline.py              # Level 2 orchestrator: wires inventory, map, ground, score, output
│   │
│   ├── gap_analysis/                # Level 2 logic modules
│   │   ├── __init__.py
│   │   ├── mapper.py                # LLM workflow mapper: proposes which systems connect for each use case
│   │   ├── grounder.py              # Guard rail: cross-references every LLM system claim against Level 1 inventory
│   │   ├── scorer.py                # Computes effort hours and business impact multiplier in deterministic code
│   │   └── reporter.py              # Renders gap_analysis_report.md including Mermaid integration diagrams
│   │
│   ├── codegen_models.py            # Pydantic schemas for all Level 3 data structures (connector specs, manifests)
│   ├── codegen_pipeline.py          # Level 3 orchestrator: wires gaps, resolve, render, validate, output
│   │
│   └── code_generation/             # Level 3 logic modules
│       ├── __init__.py
│       ├── spec_registry.py         # Curated database of real API facts (base URLs, auth schemes, pagination styles)
│       ├── spec_resolver.py         # Merges LLM field-mapping proposals with registry facts; registry always wins
│       ├── templates.py             # Deterministic Jinja2 templates for connector.py, tests, agent.yaml, README
│       ├── validator.py             # Guard rail: AST parse, import check, runs generated pytest in a subprocess
│       └── reporter.py              # Renders generation_report.md with readiness scores and validation status
│
├── samples/
│   ├── apex_digital/                # Nine-document demo corpus across five formats, ready to run immediately
│   │   ├── active_clients.csv
│   │   ├── billing_and_invoicing.txt
│   │   ├── client_onboarding_checklist.txt
│   │   ├── monthly_reporting_sop.txt
│   │   ├── ops_notes_slack_export.html
│   │   ├── team_handbook.txt
│   │   ├── tech_audit_notes.md
│   │   ├── tool_subscriptions.csv
│   │   └── use_cases_apex.json      # Example automation goals matching the apex_digital dataset
│   └── .gitkeep
│
├── tests/                           # 58 offline unit tests, no API calls, no network required
│   ├── test_ingestion.py            # Format parser tests across all supported file types
│   ├── test_extraction.py           # LLM extraction prompt construction and response parsing tests
│   ├── test_verification.py         # Anti-hallucination guard rail and substring-matching tests
│   ├── test_gap_analysis.py         # Gap mapper, inventory grounder, and effort scorer tests
│   └── test_codegen.py              # Connector rendering and validation tests (physically renders and executes connectors)
│
├── inputs/                          # PUT YOUR DOCUMENTS HERE before running --input ./inputs
│
└── outputs/                         # All generated artifacts appear here after a run
    ├── system_inventory.json
    ├── report.md
    ├── gap_analysis.json
    ├── gap_analysis_report.md
    ├── generation_manifest.json
    ├── generation_report.md
    ├── pipeline_log.jsonl
    └── connectors/
        └── <gap_id>/
            ├── connector.py
            ├── test_connector.py
            ├── agent.yaml
            ├── README.md
            └── requirements.txt

How to Run It

Step 1: Install

python -m pip install -r requirements.txt

Step 2: Configure API Key

cp .env.example .env

Open .env and fill in your GROQ_API_KEY. Then verify the setup works:

python main.py --check
# Expected output: "Configuration valid. API connection OK."

Step 3: Choose Your Input

Option A: Run on the included sample (no files needed)

The samples/apex_digital/ folder contains nine real-format documents for a fictional example company. Use this to see the full pipeline working immediately, with no setup beyond the API key.

python main.py --input ./samples/apex_digital --output ./outputs

Option B: Run on your own company documents

Before running, copy your documents into the ./inputs/ folder. Supported formats include PDF, Word (.docx), Excel (.xlsx), HTML, PNG, JPG, CSV, and plain text. Formats can be mixed freely: the agent detects each file's type from its content, not its extension.

python main.py --input ./inputs --output ./outputs

Step 4: Run the Full Pipeline (all three levels)

Add --level 3 to run Systems Discovery, Gap Analysis, and Code Generation in a single command. The --use-cases flag is optional but recommended: it tells the agent which workflows your business actually cares about, producing more relevant gap rankings. Without it, the agent infers generic workflow goals from the discovered inventory.

# With a use-cases file (recommended)
python main.py --input ./inputs --output ./outputs --level 3 --use-cases ./samples/apex_digital/use_cases_apex.json --max-connectors 3

# Without a use-cases file (agent infers goals from the inventory)
python main.py --input ./inputs --output ./outputs --level 3 --max-connectors 3

After the run, open report.md, gap_analysis_report.md, and generation_report.md in ./outputs/ for human-readable summaries of each level.

Step 5: Skip Already-Completed Levels

If Level 1 has already been run, there is no need to reprocess all documents to run Level 2. The standalone flags let the pipeline pick up from an existing artifact:

# Run only Level 2 using an existing inventory
python main.py --inventory ./outputs/system_inventory.json --use-cases ./samples/apex_digital/use_cases_apex.json

# Run only Level 3 using an existing gap analysis
python main.py --gaps ./outputs/gap_analysis.json --inventory ./outputs/system_inventory.json

CLI Reference

Flag	Default	What it does
`--input PATH`	None	Folder of documents to process through Level 1.
`--level, -l N`	`1`	Pipeline depth to run: `1`, `2`, or `3`.
`--use-cases PATH`	None	Optional JSON file of target automation workflows. Without it, the agent infers goals from the inventory.
`--inventory PATH`	None	Skip Level 1 and start Level 2 from an existing `system_inventory.json`.
`--gaps PATH`	None	Skip Levels 1 and 2 and start Level 3 from an existing `gap_analysis.json`.
`--max-connectors N`	`3`	Maximum number of connectors to generate in Level 3.
`--output PATH`	`./outputs`	Directory where all generated files are written.
`--verbose, -v`	off	Stream structured pipeline events to stdout as the run progresses.
`--check`	N/A	Validate config and test the API key, then exit without running the pipeline.

Output Files

File	Produced by	Who reads it	What is inside
`system_inventory.json`	Level 1	Downstream systems and Level 2	Full inventory with evidence quotes, source file references, and run metadata.
`report.md`	Level 1	Human reviewer	System tables, extracted evidence, confidence tiers, and run summary.
`gap_analysis.json`	Level 2	Downstream systems and Level 3	Prioritized gap list with effort estimates, dependency map, and connection status per link.
`gap_analysis_report.md`	Level 2	Human reviewer	Layered summary and Mermaid integration diagrams showing what is connected versus missing.
`connectors/<gap_id>/`	Level 3	Engineer implementing the integration	Full connector package: code, tests, agent definition, setup instructions.
`generation_manifest.json`	Level 3	Downstream systems	Connector specs, readiness scores, and validation proof (pass/fail plus test output).
`generation_report.md`	Level 3	Human reviewer	Summary of connectors built, readiness scores, and test results.
`pipeline_log.jsonl`	All levels	Auditor or debugger	One structured JSON entry per architectural decision across the full run.

Edge Case Handling

Document Ingestion (Level 1)

Scenario	How it is handled
File extension does not match actual format	Magic byte detection identifies the real type before parsing begins.
Corrupted, empty, or zero-byte file	Skipped with a log entry; does not crash the run.
Password-protected PDF	Caught and skipped with a clear error message in the log.
Scanned PDF with no embedded text	Routed to the vision model for image transcription.
Excel formula strings	Evaluated to their displayed value before extraction.
Uncommon text encodings	UTF-8 attempted first, with fallback to latin-1 and chardet detection.
File too large for a single LLM context	Split into overlapping chunks; findings are merged after verification.

LLM Calls (All Levels)

Scenario	How it is handled
Model returns a fabricated evidence quote	Rejected by the substring guard rail before the finding enters the inventory.
Model invents a system not mentioned in any document	Rejected because no quote can be verified against source text.
Model returns invalid or malformed JSON	Bracket-slice recovery attempts repair; if unrecoverable, the call is retried.
API rate limit (HTTP 429)	Exponential backoff with jitter; automatic fallback to the next model in the chain.
API 5xx error or connection timeout	Same backoff chain; event logged to `pipeline_log.jsonl`.
Prompt injection payload in document text	Document content is strictly delimited in every prompt; payload cannot escape to modify instructions.

Code Generation (Level 3)

Scenario	How it is handled
LLM invents API endpoint paths	Registry facts override the model. Unknown APIs are flagged for human review.
Generated Python has syntax errors	`ast.parse` catches them; connector is written but marked failing, readiness capped at 45%.
Duplicate field mappings in LLM output	Deduplicated by destination field in code before the template renders.
Varied pagination styles across APIs	Unified iterator handles cursor, page-number, and Link-header styles transparently.
Hardcoded credentials in generated code	Prevented by template design: credentials always read from environment variables.
Gap requires a system not in the inventory	Marked as blocked and skipped during code generation.
Generated test suite hangs	Validation subprocess is killed after a strict wall-clock timeout.

Challenge Acceptance Criteria

Level 1: Systems Discovery

Criterion	Status	Detail
Process mixed document formats	VERIFIED	13+ formats supported via magic-byte routing.
Extract the majority of systems	VERIFIED	27 systems found on the demo corpus.
Capture specific metadata fields	VERIFIED	Name, category, auth type, entities, processes, and criticality per system.
Tiered confidence scoring	VERIFIED	Deterministic and code-computed, never asked of the model.
Note inferences and missing data	VERIFIED	Inference flags appended to every sub-HIGH confidence finding.
JSON output with source references	VERIFIED	Source file, page number, and verified verbatim quote per finding.
Zero hallucinations	VERIFIED	Substring guard rail rejects any claim the source text does not support.

Level 2: Gap Analysis

Criterion	Status	Detail
Map use cases to systems	VERIFIED	Every workflow grounded against the Level 1 inventory.
Trace data flows	VERIFIED	Source system, destination system, entity type, and trigger captured per flow.
Mark connection status	VERIFIED	Live, Missing, or Blocked: determined by inventory evidence, not model output.
Estimate implementation effort	VERIFIED	Hours computed by code formula, not LLM-estimated.
Prioritize by business impact	VERIFIED	Ranked by a code-computed multiplier combining time saved and system criticality.
Dependency mapping	VERIFIED	Prerequisite blocking logic prevents impossible gaps from being recommended.

Level 3: Code Generation

Criterion	Status	Detail
Working Python code	VERIFIED	Every generated connector imports cleanly and executes without error.
Core API logic implemented	VERIFIED	Auth, CRUD, pagination, rate limiting, retries, and typed error handling present in every connector.
Unit tests included	VERIFIED	Minimum 7 tests per connector, auto-executed during the validation guard rail.
Agent definition included	VERIFIED	`agent.yaml` generated with real tool bindings to the connector methods.
Setup instructions	VERIFIED	Per-connector `README.md` and `requirements.txt` generated automatically.
70 to 80 percent production-ready	VERIFIED	Readiness score computed from registry versus AI-inferred logic split.

Test Suite

python -m pytest

58 offline unit tests. No API calls are made during the test run: all LLM interactions are mocked at the boundary. The Level 3 tests are an exception: they physically render connectors using real templates and execute the generated pytest suite in a sandbox to prove the output is runnable code, not just syntactically valid text.

Test file coverage:

File	What it tests
`test_ingestion.py`	Format router, each parser type, magic byte detection, malformed file handling
`test_extraction.py`	LLM prompt construction, response parsing, Pydantic schema validation
`test_verification.py`	Quote guard rail, substring matching, rejection of fabricated quotes
`test_gap_analysis.py`	Workflow mapper, inventory grounder, effort scorer, ROI calculation
`test_codegen.py`	Spec registry lookup, template rendering, AST validation, generated test execution

Security

Control	Implementation
No hardcoded secrets	API keys loaded exclusively from `.env`; `.env` is in `.gitignore` and excluded from every generated artifact.
File type validation	Magic byte check runs before any parser is invoked: a renamed malicious file cannot reach the parsing layer.
Prompt injection defense	Document text is wrapped in strict delimiters inside every prompt. Content inside the delimiters cannot escape to modify the instruction.
No eval or exec	LLM output is always parsed as JSON via Pydantic. It is never passed to `eval()`, `exec()`, or `subprocess` with `shell=True`.
Generated connector safety	Connector templates read credentials from environment variables. The LLM cannot produce hardcoded secrets through the template system.
Subprocess sandboxing	The validation subprocess that runs generated tests has a wall-clock timeout and no access to the parent process environment.

Known Limitations

Field mappings are LLM-inferred. The connector scaffolding (auth, pagination, retries) is deterministic and reliable. The data field mapping between two APIs is the model's best guess at what corresponds to what. Every generated connector explicitly flags this for human engineering review before production use.

Registry coverage. The curated spec_registry.py covers the most common business SaaS APIs. Systems outside the registry are handled by AI inference: connectors for those systems get lower readiness scores and are explicitly marked as unverified. Adding a new API to the registry is straightforward and documented inside spec_registry.py.

Groq rate limits at scale. Large document corpora (100 or more files) will hit Groq's per-model rate limits. The system handles this gracefully: it backs off, rotates to the fallback model chain, and continues. Total processing time increases for large corpora. Run with --verbose to monitor progress.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
discovery_agent		discovery_agent
inputs		inputs
outputs		outputs
samples		samples
tests		tests
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
main.py		main.py
pipeline.svg		pipeline.svg
pytest.ini		pytest.ini
requirements.txt		requirements.txt
writeup.md		writeup.md

Folders and files

Latest commit

History

Repository files navigation

Flowgent

Table of Contents

What This Project Does

The Hallucination Problem and How It Is Solved

System Architecture

Pipeline Overview

Level 1: Systems Discovery

Level 2: Gap Analysis

Level 3: Code Generation

What Each Generated Connector Contains

Technology Stack

Project Layout

How to Run It

Step 1: Install

Step 2: Configure API Key

Step 3: Choose Your Input

Option A: Run on the included sample (no files needed)

Option B: Run on your own company documents

Step 4: Run the Full Pipeline (all three levels)

Step 5: Skip Already-Completed Levels

CLI Reference

Output Files

Edge Case Handling

Document Ingestion (Level 1)

LLM Calls (All Levels)

Code Generation (Level 3)

Challenge Acceptance Criteria

Level 1: Systems Discovery

Level 2: Gap Analysis

Level 3: Code Generation

Test Suite

Security

Known Limitations

About

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages