A complete solution for the Aivar Innovations Discovery Agent challenge, covering all three levels with verified results and 58 passing unit tests.
Companies often lack a clear picture of which software tools they actually use, and which of those tools should be exchanging data but are not. Flowgent reads a company's existing documents and produces that picture in three steps: a verified system inventory, a prioritized list of missing integrations, and runnable Python connector code for the highest-impact gaps. Every output is grounded in source text with no hallucinations.
This repository includes samples/apex_digital/, a nine-document sample corpus for a fictional company. The results below came from running the full pipeline against that dataset:
DEMO RESULTS (samples/apex_digital, included in this repo)
────────────────────────────────────────────────────────────
Level 1 | 27 systems found | 0 hallucinations | 9/9 docs
Level 2 | 14 gaps identified | 7 workflows mapped
Level 3 | 3 connectors built | 3/3 pass unit tests
────────────────────────────────────────────────────────────
- What This Project Does
- System Architecture
- Technology Stack
- Project Layout
- How to Run It
- CLI Reference
- Output Files
- Edge Case Handling
- Challenge Acceptance Criteria
- Test Suite
- Security
- Known Limitations
Most companies use dozens of software tools: Slack, QuickBooks, Shopify, and so on. Nobody has a clean list of what is actually installed, and nobody knows which tools should be talking to each other but are not, or how many hours of manual work that gap costs each week.
Flowgent solves this in three steps:
- Level 1: Systems Discovery reads company documents (PDFs, spreadsheets, Word files, images, HTML exports) and builds a clean, verified list of every software tool mentioned. If a tool cannot be cited with a verbatim quote from a source file, it is not included.
- Level 2: Gap Analysis compares the tool inventory against automation goals, then identifies every missing connection: which integrations do not exist, how hard they would be to build, and how much manual time they cost.
- Level 3: Code Generation writes actual, runnable Python code to connect the highest-priority tool pairs. Not boilerplate: real auth handling, pagination, retry logic, typed errors, and unit tests.
The core design rule: the LLM proposes, code disposes.
AI models make things up. They will confidently name a software system that never appeared in any document, or invent an API endpoint that does not exist. Every level of this pipeline treats the LLM as the only untrusted component and passes every LLM output through a deterministic code guard rail before accepting it:
| Level | LLM proposes | Code verifies by |
|---|---|---|
| 1 | A system name and a verbatim quote from the source document | Substring-searching the actual source text. Rejects anything not physically present in the file. |
| 2 | Which systems are involved in a workflow and whether they are connected | Cross-referencing against the verified Level 1 inventory. Marks a link live only if a stored evidence quote proves it. |
| 3 | What entities and fields a connector should map | Rendering code from hand-written, pre-tested templates. The LLM never writes auth, retry, or pagination logic. |
Confidence scores and effort estimates are computed by code, never by the model. The full reasoning trail is written to pipeline_log.jsonl and every decision is auditable.
Input: Any folder of business documents (PDF, DOCX, XLSX, HTML, PNG, CSV, and more)
What happens inside:
- Format Router detects the real file type using magic bytes, not the file extension. A
.pdfthat is actually a Word document gets caught and routed correctly. - Text Chunker splits long documents into overlapping segments so no mention of a system is lost at a page boundary.
- LLM Extraction asks the model to identify software systems and return a verbatim quote from the chunk as evidence. This output is untrusted.
- Quote Guard Rail checks that the returned quote is a literal substring of the source chunk. If it is not, the finding is rejected. This is the hallucination firewall.
- Taxonomy Resolver maps 200+ known aliases (
"SFDC","Salesforce CRM","salesforce.com") to canonical system names so duplicates across documents collapse correctly. - Deterministic Scorer calculates a confidence score in code based on how many independent documents mention the system, how strong the evidence quote is, and what category of system it belongs to. The model never assigns its own confidence.
Output: system_inventory.json and report.md
Input: system_inventory.json from Level 1, and optionally a use_cases.json file listing the automation workflows that matter to the business.
Note on
use_cases.json: This file is completely optional. Without it, the agent infers generic workflow goals from the discovered system inventory. Providing it focuses the analysis on the specific integrations the business actually cares about and generally produces more relevant gap rankings. Seesamples/apex_digital/use_cases_apex.jsonfor the expected format.
What happens inside:
- Workflow Mapper asks the LLM which systems are involved in each use case and what data flows between them. Output is untrusted.
- Inventory Guard Rail cross-references every system the LLM named against the verified Level 1 inventory. Any system not in the inventory is rejected, preventing the model from inventing connections to systems that do not exist.
- Connection Checker marks each connection as
live,missing, orblocked. A connection is marked live only if an evidence quote in the inventory proves it. Blocked means a required prerequisite system is absent. - Effort and ROI Scorer estimates implementation effort in hours and computes a business impact multiplier using a deterministic formula. The model never estimates effort.
Output: gap_analysis.json and gap_analysis_report.md
Input: gap_analysis.json from Level 2, specifically the top-N highest-priority gaps.
A naive implementation would ask the LLM to write a Shopify-to-QuickBooks connector. It would confidently invent endpoint paths, authentication headers, and pagination schemes that look plausible but are wrong. This fails the zero-hallucination requirement.
What happens inside:
- Spec Registry Lookup (
spec_registry.py) checks a curated database of real, verifiable API facts: base URLs, authentication methods (OAuth2, API key, Bearer token), and pagination styles (cursor, page-number, Link header). For any system in the registry, these facts are treated as ground truth. - LLM Spec Resolution (
spec_resolver.py) asks the model to propose which entities to connect and how fields map between them. For registry systems, the model's proposed API facts are overridden by registry data. For unknown systems, the output is flagged as unverified. - Template Renderer (
templates.py) generates the actual Python code from hand-written, unit-tested templates. The auth handshake, retry loop with exponential backoff, rate limiter, pagination iterator, and typed error taxonomy are written by a human engineer. The LLM never touches this logic. - Validation Guard Rail (
validation.py) parses the generated code as an AST to catch syntax errors, imports the module to verify it loads, then runs the generated pytest suite in a sandboxed subprocess with a strict wall-clock timeout.
Output: outputs/connectors/<gap_id>/ per connector, plus generation_manifest.json and generation_report.md
Each connector package lands in outputs/connectors/<gap_id>/:
| File | What is inside |
|---|---|
connector.py |
Environment-variable auth, full CRUD per entity, transparent pagination iterator, rate limiter, retries with exponential backoff, typed error classes, and a run() workflow method. |
test_connector.py |
Offline unit tests covering auth headers, pagination behavior, 429 rate-limit retries, HTTP error handling, create operations, and field mapping. Minimum 7 tests per connector. |
agent.yaml |
Agent definition with system prompt, tool bindings wired to the real connector methods, workflow steps, and guardrails. |
README.md |
Per-connector setup instructions, credential environment variable list, and usage examples. |
requirements.txt |
Dependency manifest for the connector package. |
Readiness scores are computed per connector:
- Registry API with ground-truth facts available: approximately 85% ready
- AI-inferred API with facts flagged as unverified: approximately 60% ready
- Any connector that fails its own generated tests: capped at 45%
| Component | Choice | Why |
|---|---|---|
| Language | Python 3.11+ | Best LLM and document tooling ecosystem. |
| LLM provider | Groq | Fastest inference (LPU hardware), OpenAI-compatible API. |
| Primary model | llama-3.3-70b-versatile |
Strong structured extraction, reliable JSON mode. |
| Fallback chain | gpt-oss-120b then qwen3-32b |
Independent rate-limit pools: if primary is throttled, the fallback picks up automatically. |
| Vision model | llama-4-scout-17b |
Multimodal, handles scanned documents and image-only PDFs. |
| Schema validation | Pydantic v2 | Strict type enforcement on every LLM response: malformed output fails fast. |
| HTTP client (connectors) | requests |
Ubiquitous, human-reviewable, no magic. |
| Document parsers | PyMuPDF, python-docx, openpyxl, BeautifulSoup4 | Native, battle-tested parsers for PDF, Word, Excel, and HTML. |
| Rate limiting | Token-bucket with exponential backoff and jitter | Handles Groq per-model rate limits gracefully at scale. |
FLOWGENT_CODEBASE/
│
├── main.py # CLI entry point: parses flags and routes to the correct pipeline
├── requirements.txt # All pinned Python dependencies
├── pytest.ini # Test runner configuration (rootdir, test discovery paths)
├── .env.example # Template showing every required environment variable
├── writeup.md # Architecture decisions and trade-offs
├── discovery_agent/ # Core engine: all pipeline logic lives here
│ │
│ ├── config.py # Loads .env, validates required keys, exposes settings object
│ ├── models.py # Pydantic schemas for all Level 1 data structures
│ ├── observability.py # Structured JSON logging to pipeline_log.jsonl
│ ├── llm.py # Resilient Groq API client: rate limiter, backoff, fallback chain, JSON repair
│ ├── utils.py # Shared helpers: JSON recovery, text cleaning, retry wrappers
│ ├── taxonomy.py # Alias resolution: maps 200+ system name variants to canonical names
│ ├── chunking.py # Splits document text into overlapping segments for LLM processing
│ ├── extraction.py # Level 1 LLM prompts: asks model to identify systems and return evidence quotes
│ ├── verification.py # Anti-hallucination guard rail: substring-checks every LLM quote against source
│ ├── aggregation.py # Deduplicates system findings across multiple documents and chunks
│ ├── scoring.py # Deterministic confidence calculator: code-based, never LLM-based
│ ├── pipeline.py # Level 1 orchestrator: wires ingest, extract, verify, score, output
│ │
│ ├── ingestion/ # Document format parsers, one module per file type
│ │ ├── __init__.py # Format router: dispatches each file to the right parser using magic bytes
│ │ ├── pdf_parser.py # PDF extraction via PyMuPDF; falls back to vision model for image-only pages
│ │ ├── docx_parser.py # Word document extraction via python-docx
│ │ ├── xlsx_parser.py # Excel extraction via openpyxl; handles formulas and multi-sheet files
│ │ ├── html_parser.py # HTML extraction via BeautifulSoup4; strips tags and navigation boilerplate
│ │ ├── image_parser.py # PNG/JPG transcription via llama-4-scout-17b vision model
│ │ └── text_parser.py # Plain text and CSV fallback parser
│ │
│ ├── data/
│ │ └── taxonomy.json # 200+ curated system name aliases (example: SFDC maps to Salesforce)
│ │
│ ├── gap_models.py # Pydantic schemas for all Level 2 data structures (gaps, workflows, connections)
│ ├── gap_pipeline.py # Level 2 orchestrator: wires inventory, map, ground, score, output
│ │
│ ├── gap_analysis/ # Level 2 logic modules
│ │ ├── __init__.py
│ │ ├── mapper.py # LLM workflow mapper: proposes which systems connect for each use case
│ │ ├── grounder.py # Guard rail: cross-references every LLM system claim against Level 1 inventory
│ │ ├── scorer.py # Computes effort hours and business impact multiplier in deterministic code
│ │ └── reporter.py # Renders gap_analysis_report.md including Mermaid integration diagrams
│ │
│ ├── codegen_models.py # Pydantic schemas for all Level 3 data structures (connector specs, manifests)
│ ├── codegen_pipeline.py # Level 3 orchestrator: wires gaps, resolve, render, validate, output
│ │
│ └── code_generation/ # Level 3 logic modules
│ ├── __init__.py
│ ├── spec_registry.py # Curated database of real API facts (base URLs, auth schemes, pagination styles)
│ ├── spec_resolver.py # Merges LLM field-mapping proposals with registry facts; registry always wins
│ ├── templates.py # Deterministic Jinja2 templates for connector.py, tests, agent.yaml, README
│ ├── validator.py # Guard rail: AST parse, import check, runs generated pytest in a subprocess
│ └── reporter.py # Renders generation_report.md with readiness scores and validation status
│
├── samples/
│ ├── apex_digital/ # Nine-document demo corpus across five formats, ready to run immediately
│ │ ├── active_clients.csv
│ │ ├── billing_and_invoicing.txt
│ │ ├── client_onboarding_checklist.txt
│ │ ├── monthly_reporting_sop.txt
│ │ ├── ops_notes_slack_export.html
│ │ ├── team_handbook.txt
│ │ ├── tech_audit_notes.md
│ │ ├── tool_subscriptions.csv
│ │ └── use_cases_apex.json # Example automation goals matching the apex_digital dataset
│ └── .gitkeep
│
├── tests/ # 58 offline unit tests, no API calls, no network required
│ ├── test_ingestion.py # Format parser tests across all supported file types
│ ├── test_extraction.py # LLM extraction prompt construction and response parsing tests
│ ├── test_verification.py # Anti-hallucination guard rail and substring-matching tests
│ ├── test_gap_analysis.py # Gap mapper, inventory grounder, and effort scorer tests
│ └── test_codegen.py # Connector rendering and validation tests (physically renders and executes connectors)
│
├── inputs/ # PUT YOUR DOCUMENTS HERE before running --input ./inputs
│
└── outputs/ # All generated artifacts appear here after a run
├── system_inventory.json
├── report.md
├── gap_analysis.json
├── gap_analysis_report.md
├── generation_manifest.json
├── generation_report.md
├── pipeline_log.jsonl
└── connectors/
└── <gap_id>/
├── connector.py
├── test_connector.py
├── agent.yaml
├── README.md
└── requirements.txt
python -m pip install -r requirements.txtcp .env.example .envOpen .env and fill in your GROQ_API_KEY. Then verify the setup works:
python main.py --check
# Expected output: "Configuration valid. API connection OK."The samples/apex_digital/ folder contains nine real-format documents for a fictional example company. Use this to see the full pipeline working immediately, with no setup beyond the API key.
python main.py --input ./samples/apex_digital --output ./outputsBefore running, copy your documents into the ./inputs/ folder. Supported formats include PDF, Word (.docx), Excel (.xlsx), HTML, PNG, JPG, CSV, and plain text. Formats can be mixed freely: the agent detects each file's type from its content, not its extension.
python main.py --input ./inputs --output ./outputsAdd --level 3 to run Systems Discovery, Gap Analysis, and Code Generation in a single command. The --use-cases flag is optional but recommended: it tells the agent which workflows your business actually cares about, producing more relevant gap rankings. Without it, the agent infers generic workflow goals from the discovered inventory.
# With a use-cases file (recommended)
python main.py --input ./inputs --output ./outputs --level 3 --use-cases ./samples/apex_digital/use_cases_apex.json --max-connectors 3
# Without a use-cases file (agent infers goals from the inventory)
python main.py --input ./inputs --output ./outputs --level 3 --max-connectors 3After the run, open report.md, gap_analysis_report.md, and generation_report.md in ./outputs/ for human-readable summaries of each level.
If Level 1 has already been run, there is no need to reprocess all documents to run Level 2. The standalone flags let the pipeline pick up from an existing artifact:
# Run only Level 2 using an existing inventory
python main.py --inventory ./outputs/system_inventory.json --use-cases ./samples/apex_digital/use_cases_apex.json
# Run only Level 3 using an existing gap analysis
python main.py --gaps ./outputs/gap_analysis.json --inventory ./outputs/system_inventory.json| Flag | Default | What it does |
|---|---|---|
--input PATH |
None | Folder of documents to process through Level 1. |
--level, -l N |
1 |
Pipeline depth to run: 1, 2, or 3. |
--use-cases PATH |
None | Optional JSON file of target automation workflows. Without it, the agent infers goals from the inventory. |
--inventory PATH |
None | Skip Level 1 and start Level 2 from an existing system_inventory.json. |
--gaps PATH |
None | Skip Levels 1 and 2 and start Level 3 from an existing gap_analysis.json. |
--max-connectors N |
3 |
Maximum number of connectors to generate in Level 3. |
--output PATH |
./outputs |
Directory where all generated files are written. |
--verbose, -v |
off | Stream structured pipeline events to stdout as the run progresses. |
--check |
N/A | Validate config and test the API key, then exit without running the pipeline. |
| File | Produced by | Who reads it | What is inside |
|---|---|---|---|
system_inventory.json |
Level 1 | Downstream systems and Level 2 | Full inventory with evidence quotes, source file references, and run metadata. |
report.md |
Level 1 | Human reviewer | System tables, extracted evidence, confidence tiers, and run summary. |
gap_analysis.json |
Level 2 | Downstream systems and Level 3 | Prioritized gap list with effort estimates, dependency map, and connection status per link. |
gap_analysis_report.md |
Level 2 | Human reviewer | Layered summary and Mermaid integration diagrams showing what is connected versus missing. |
connectors/<gap_id>/ |
Level 3 | Engineer implementing the integration | Full connector package: code, tests, agent definition, setup instructions. |
generation_manifest.json |
Level 3 | Downstream systems | Connector specs, readiness scores, and validation proof (pass/fail plus test output). |
generation_report.md |
Level 3 | Human reviewer | Summary of connectors built, readiness scores, and test results. |
pipeline_log.jsonl |
All levels | Auditor or debugger | One structured JSON entry per architectural decision across the full run. |
| Scenario | How it is handled |
|---|---|
| File extension does not match actual format | Magic byte detection identifies the real type before parsing begins. |
| Corrupted, empty, or zero-byte file | Skipped with a log entry; does not crash the run. |
| Password-protected PDF | Caught and skipped with a clear error message in the log. |
| Scanned PDF with no embedded text | Routed to the vision model for image transcription. |
| Excel formula strings | Evaluated to their displayed value before extraction. |
| Uncommon text encodings | UTF-8 attempted first, with fallback to latin-1 and chardet detection. |
| File too large for a single LLM context | Split into overlapping chunks; findings are merged after verification. |
| Scenario | How it is handled |
|---|---|
| Model returns a fabricated evidence quote | Rejected by the substring guard rail before the finding enters the inventory. |
| Model invents a system not mentioned in any document | Rejected because no quote can be verified against source text. |
| Model returns invalid or malformed JSON | Bracket-slice recovery attempts repair; if unrecoverable, the call is retried. |
| API rate limit (HTTP 429) | Exponential backoff with jitter; automatic fallback to the next model in the chain. |
| API 5xx error or connection timeout | Same backoff chain; event logged to pipeline_log.jsonl. |
| Prompt injection payload in document text | Document content is strictly delimited in every prompt; payload cannot escape to modify instructions. |
| Scenario | How it is handled |
|---|---|
| LLM invents API endpoint paths | Registry facts override the model. Unknown APIs are flagged for human review. |
| Generated Python has syntax errors | ast.parse catches them; connector is written but marked failing, readiness capped at 45%. |
| Duplicate field mappings in LLM output | Deduplicated by destination field in code before the template renders. |
| Varied pagination styles across APIs | Unified iterator handles cursor, page-number, and Link-header styles transparently. |
| Hardcoded credentials in generated code | Prevented by template design: credentials always read from environment variables. |
| Gap requires a system not in the inventory | Marked as blocked and skipped during code generation. |
| Generated test suite hangs | Validation subprocess is killed after a strict wall-clock timeout. |
| Criterion | Status | Detail |
|---|---|---|
| Process mixed document formats | VERIFIED | 13+ formats supported via magic-byte routing. |
| Extract the majority of systems | VERIFIED | 27 systems found on the demo corpus. |
| Capture specific metadata fields | VERIFIED | Name, category, auth type, entities, processes, and criticality per system. |
| Tiered confidence scoring | VERIFIED | Deterministic and code-computed, never asked of the model. |
| Note inferences and missing data | VERIFIED | Inference flags appended to every sub-HIGH confidence finding. |
| JSON output with source references | VERIFIED | Source file, page number, and verified verbatim quote per finding. |
| Zero hallucinations | VERIFIED | Substring guard rail rejects any claim the source text does not support. |
| Criterion | Status | Detail |
|---|---|---|
| Map use cases to systems | VERIFIED | Every workflow grounded against the Level 1 inventory. |
| Trace data flows | VERIFIED | Source system, destination system, entity type, and trigger captured per flow. |
| Mark connection status | VERIFIED | Live, Missing, or Blocked: determined by inventory evidence, not model output. |
| Estimate implementation effort | VERIFIED | Hours computed by code formula, not LLM-estimated. |
| Prioritize by business impact | VERIFIED | Ranked by a code-computed multiplier combining time saved and system criticality. |
| Dependency mapping | VERIFIED | Prerequisite blocking logic prevents impossible gaps from being recommended. |
| Criterion | Status | Detail |
|---|---|---|
| Working Python code | VERIFIED | Every generated connector imports cleanly and executes without error. |
| Core API logic implemented | VERIFIED | Auth, CRUD, pagination, rate limiting, retries, and typed error handling present in every connector. |
| Unit tests included | VERIFIED | Minimum 7 tests per connector, auto-executed during the validation guard rail. |
| Agent definition included | VERIFIED | agent.yaml generated with real tool bindings to the connector methods. |
| Setup instructions | VERIFIED | Per-connector README.md and requirements.txt generated automatically. |
| 70 to 80 percent production-ready | VERIFIED | Readiness score computed from registry versus AI-inferred logic split. |
python -m pytest58 offline unit tests. No API calls are made during the test run: all LLM interactions are mocked at the boundary. The Level 3 tests are an exception: they physically render connectors using real templates and execute the generated pytest suite in a sandbox to prove the output is runnable code, not just syntactically valid text.
Test file coverage:
| File | What it tests |
|---|---|
test_ingestion.py |
Format router, each parser type, magic byte detection, malformed file handling |
test_extraction.py |
LLM prompt construction, response parsing, Pydantic schema validation |
test_verification.py |
Quote guard rail, substring matching, rejection of fabricated quotes |
test_gap_analysis.py |
Workflow mapper, inventory grounder, effort scorer, ROI calculation |
test_codegen.py |
Spec registry lookup, template rendering, AST validation, generated test execution |
| Control | Implementation |
|---|---|
| No hardcoded secrets | API keys loaded exclusively from .env; .env is in .gitignore and excluded from every generated artifact. |
| File type validation | Magic byte check runs before any parser is invoked: a renamed malicious file cannot reach the parsing layer. |
| Prompt injection defense | Document text is wrapped in strict delimiters inside every prompt. Content inside the delimiters cannot escape to modify the instruction. |
| No eval or exec | LLM output is always parsed as JSON via Pydantic. It is never passed to eval(), exec(), or subprocess with shell=True. |
| Generated connector safety | Connector templates read credentials from environment variables. The LLM cannot produce hardcoded secrets through the template system. |
| Subprocess sandboxing | The validation subprocess that runs generated tests has a wall-clock timeout and no access to the parent process environment. |
Field mappings are LLM-inferred. The connector scaffolding (auth, pagination, retries) is deterministic and reliable. The data field mapping between two APIs is the model's best guess at what corresponds to what. Every generated connector explicitly flags this for human engineering review before production use.
Registry coverage. The curated spec_registry.py covers the most common business SaaS APIs. Systems outside the registry are handled by AI inference: connectors for those systems get lower readiness scores and are explicitly marked as unverified. Adding a new API to the registry is straightforward and documented inside spec_registry.py.
Groq rate limits at scale. Large document corpora (100 or more files) will hit Groq's per-model rate limits. The system handles this gracefully: it backs off, rotates to the fallback model chain, and continues. Total processing time increases for large corpora. Run with --verbose to monitor progress.