Skip to content

ctrl-a-shift-del/Flowgent-A-Discovery-Agent-For-Business-Automation

Repository files navigation

Flowgent

A complete solution for the Aivar Innovations Discovery Agent challenge, covering all three levels with verified results and 58 passing unit tests.


Companies often lack a clear picture of which software tools they actually use, and which of those tools should be exchanging data but are not. Flowgent reads a company's existing documents and produces that picture in three steps: a verified system inventory, a prioritized list of missing integrations, and runnable Python connector code for the highest-impact gaps. Every output is grounded in source text with no hallucinations.

This repository includes samples/apex_digital/, a nine-document sample corpus for a fictional company. The results below came from running the full pipeline against that dataset:

  DEMO RESULTS  (samples/apex_digital, included in this repo)
  ────────────────────────────────────────────────────────────
  Level 1  |  27 systems found   |  0 hallucinations  |  9/9 docs
  Level 2  |  14 gaps identified |  7 workflows mapped
  Level 3  |  3 connectors built |  3/3 pass unit tests
  ────────────────────────────────────────────────────────────

Table of Contents


What This Project Does

Most companies use dozens of software tools: Slack, QuickBooks, Shopify, and so on. Nobody has a clean list of what is actually installed, and nobody knows which tools should be talking to each other but are not, or how many hours of manual work that gap costs each week.

Flowgent solves this in three steps:

  • Level 1: Systems Discovery reads company documents (PDFs, spreadsheets, Word files, images, HTML exports) and builds a clean, verified list of every software tool mentioned. If a tool cannot be cited with a verbatim quote from a source file, it is not included.
  • Level 2: Gap Analysis compares the tool inventory against automation goals, then identifies every missing connection: which integrations do not exist, how hard they would be to build, and how much manual time they cost.
  • Level 3: Code Generation writes actual, runnable Python code to connect the highest-priority tool pairs. Not boilerplate: real auth handling, pagination, retry logic, typed errors, and unit tests.

The Hallucination Problem and How It Is Solved

The core design rule: the LLM proposes, code disposes.

AI models make things up. They will confidently name a software system that never appeared in any document, or invent an API endpoint that does not exist. Every level of this pipeline treats the LLM as the only untrusted component and passes every LLM output through a deterministic code guard rail before accepting it:

Level LLM proposes Code verifies by
1 A system name and a verbatim quote from the source document Substring-searching the actual source text. Rejects anything not physically present in the file.
2 Which systems are involved in a workflow and whether they are connected Cross-referencing against the verified Level 1 inventory. Marks a link live only if a stored evidence quote proves it.
3 What entities and fields a connector should map Rendering code from hand-written, pre-tested templates. The LLM never writes auth, retry, or pagination logic.

Confidence scores and effort estimates are computed by code, never by the model. The full reasoning trail is written to pipeline_log.jsonl and every decision is auditable.


System Architecture

Pipeline Overview

Flowgent Pipeline Diagram


Level 1: Systems Discovery

Input: Any folder of business documents (PDF, DOCX, XLSX, HTML, PNG, CSV, and more)

What happens inside:

  1. Format Router detects the real file type using magic bytes, not the file extension. A .pdf that is actually a Word document gets caught and routed correctly.
  2. Text Chunker splits long documents into overlapping segments so no mention of a system is lost at a page boundary.
  3. LLM Extraction asks the model to identify software systems and return a verbatim quote from the chunk as evidence. This output is untrusted.
  4. Quote Guard Rail checks that the returned quote is a literal substring of the source chunk. If it is not, the finding is rejected. This is the hallucination firewall.
  5. Taxonomy Resolver maps 200+ known aliases ("SFDC", "Salesforce CRM", "salesforce.com") to canonical system names so duplicates across documents collapse correctly.
  6. Deterministic Scorer calculates a confidence score in code based on how many independent documents mention the system, how strong the evidence quote is, and what category of system it belongs to. The model never assigns its own confidence.

Output: system_inventory.json and report.md


Level 2: Gap Analysis

Input: system_inventory.json from Level 1, and optionally a use_cases.json file listing the automation workflows that matter to the business.

Note on use_cases.json: This file is completely optional. Without it, the agent infers generic workflow goals from the discovered system inventory. Providing it focuses the analysis on the specific integrations the business actually cares about and generally produces more relevant gap rankings. See samples/apex_digital/use_cases_apex.json for the expected format.

What happens inside:

  1. Workflow Mapper asks the LLM which systems are involved in each use case and what data flows between them. Output is untrusted.
  2. Inventory Guard Rail cross-references every system the LLM named against the verified Level 1 inventory. Any system not in the inventory is rejected, preventing the model from inventing connections to systems that do not exist.
  3. Connection Checker marks each connection as live, missing, or blocked. A connection is marked live only if an evidence quote in the inventory proves it. Blocked means a required prerequisite system is absent.
  4. Effort and ROI Scorer estimates implementation effort in hours and computes a business impact multiplier using a deterministic formula. The model never estimates effort.

Output: gap_analysis.json and gap_analysis_report.md


Level 3: Code Generation

Input: gap_analysis.json from Level 2, specifically the top-N highest-priority gaps.

A naive implementation would ask the LLM to write a Shopify-to-QuickBooks connector. It would confidently invent endpoint paths, authentication headers, and pagination schemes that look plausible but are wrong. This fails the zero-hallucination requirement.

What happens inside:

  1. Spec Registry Lookup (spec_registry.py) checks a curated database of real, verifiable API facts: base URLs, authentication methods (OAuth2, API key, Bearer token), and pagination styles (cursor, page-number, Link header). For any system in the registry, these facts are treated as ground truth.
  2. LLM Spec Resolution (spec_resolver.py) asks the model to propose which entities to connect and how fields map between them. For registry systems, the model's proposed API facts are overridden by registry data. For unknown systems, the output is flagged as unverified.
  3. Template Renderer (templates.py) generates the actual Python code from hand-written, unit-tested templates. The auth handshake, retry loop with exponential backoff, rate limiter, pagination iterator, and typed error taxonomy are written by a human engineer. The LLM never touches this logic.
  4. Validation Guard Rail (validation.py) parses the generated code as an AST to catch syntax errors, imports the module to verify it loads, then runs the generated pytest suite in a sandboxed subprocess with a strict wall-clock timeout.

Output: outputs/connectors/<gap_id>/ per connector, plus generation_manifest.json and generation_report.md

What Each Generated Connector Contains

Each connector package lands in outputs/connectors/<gap_id>/:

File What is inside
connector.py Environment-variable auth, full CRUD per entity, transparent pagination iterator, rate limiter, retries with exponential backoff, typed error classes, and a run() workflow method.
test_connector.py Offline unit tests covering auth headers, pagination behavior, 429 rate-limit retries, HTTP error handling, create operations, and field mapping. Minimum 7 tests per connector.
agent.yaml Agent definition with system prompt, tool bindings wired to the real connector methods, workflow steps, and guardrails.
README.md Per-connector setup instructions, credential environment variable list, and usage examples.
requirements.txt Dependency manifest for the connector package.

Readiness scores are computed per connector:

  • Registry API with ground-truth facts available: approximately 85% ready
  • AI-inferred API with facts flagged as unverified: approximately 60% ready
  • Any connector that fails its own generated tests: capped at 45%

Technology Stack

Component Choice Why
Language Python 3.11+ Best LLM and document tooling ecosystem.
LLM provider Groq Fastest inference (LPU hardware), OpenAI-compatible API.
Primary model llama-3.3-70b-versatile Strong structured extraction, reliable JSON mode.
Fallback chain gpt-oss-120b then qwen3-32b Independent rate-limit pools: if primary is throttled, the fallback picks up automatically.
Vision model llama-4-scout-17b Multimodal, handles scanned documents and image-only PDFs.
Schema validation Pydantic v2 Strict type enforcement on every LLM response: malformed output fails fast.
HTTP client (connectors) requests Ubiquitous, human-reviewable, no magic.
Document parsers PyMuPDF, python-docx, openpyxl, BeautifulSoup4 Native, battle-tested parsers for PDF, Word, Excel, and HTML.
Rate limiting Token-bucket with exponential backoff and jitter Handles Groq per-model rate limits gracefully at scale.

Project Layout

FLOWGENT_CODEBASE/
│
├── main.py                          # CLI entry point: parses flags and routes to the correct pipeline
├── requirements.txt                 # All pinned Python dependencies
├── pytest.ini                       # Test runner configuration (rootdir, test discovery paths)
├── .env.example                     # Template showing every required environment variable
├── writeup.md                       # Architecture decisions and trade-offs
├── discovery_agent/                 # Core engine: all pipeline logic lives here
│   │
│   ├── config.py                    # Loads .env, validates required keys, exposes settings object
│   ├── models.py                    # Pydantic schemas for all Level 1 data structures
│   ├── observability.py             # Structured JSON logging to pipeline_log.jsonl
│   ├── llm.py                       # Resilient Groq API client: rate limiter, backoff, fallback chain, JSON repair
│   ├── utils.py                     # Shared helpers: JSON recovery, text cleaning, retry wrappers
│   ├── taxonomy.py                  # Alias resolution: maps 200+ system name variants to canonical names
│   ├── chunking.py                  # Splits document text into overlapping segments for LLM processing
│   ├── extraction.py                # Level 1 LLM prompts: asks model to identify systems and return evidence quotes
│   ├── verification.py              # Anti-hallucination guard rail: substring-checks every LLM quote against source
│   ├── aggregation.py               # Deduplicates system findings across multiple documents and chunks
│   ├── scoring.py                   # Deterministic confidence calculator: code-based, never LLM-based
│   ├── pipeline.py                  # Level 1 orchestrator: wires ingest, extract, verify, score, output
│   │
│   ├── ingestion/                   # Document format parsers, one module per file type
│   │   ├── __init__.py              # Format router: dispatches each file to the right parser using magic bytes
│   │   ├── pdf_parser.py            # PDF extraction via PyMuPDF; falls back to vision model for image-only pages
│   │   ├── docx_parser.py           # Word document extraction via python-docx
│   │   ├── xlsx_parser.py           # Excel extraction via openpyxl; handles formulas and multi-sheet files
│   │   ├── html_parser.py           # HTML extraction via BeautifulSoup4; strips tags and navigation boilerplate
│   │   ├── image_parser.py          # PNG/JPG transcription via llama-4-scout-17b vision model
│   │   └── text_parser.py           # Plain text and CSV fallback parser
│   │
│   ├── data/
│   │   └── taxonomy.json            # 200+ curated system name aliases (example: SFDC maps to Salesforce)
│   │
│   ├── gap_models.py                # Pydantic schemas for all Level 2 data structures (gaps, workflows, connections)
│   ├── gap_pipeline.py              # Level 2 orchestrator: wires inventory, map, ground, score, output
│   │
│   ├── gap_analysis/                # Level 2 logic modules
│   │   ├── __init__.py
│   │   ├── mapper.py                # LLM workflow mapper: proposes which systems connect for each use case
│   │   ├── grounder.py              # Guard rail: cross-references every LLM system claim against Level 1 inventory
│   │   ├── scorer.py                # Computes effort hours and business impact multiplier in deterministic code
│   │   └── reporter.py              # Renders gap_analysis_report.md including Mermaid integration diagrams
│   │
│   ├── codegen_models.py            # Pydantic schemas for all Level 3 data structures (connector specs, manifests)
│   ├── codegen_pipeline.py          # Level 3 orchestrator: wires gaps, resolve, render, validate, output
│   │
│   └── code_generation/             # Level 3 logic modules
│       ├── __init__.py
│       ├── spec_registry.py         # Curated database of real API facts (base URLs, auth schemes, pagination styles)
│       ├── spec_resolver.py         # Merges LLM field-mapping proposals with registry facts; registry always wins
│       ├── templates.py             # Deterministic Jinja2 templates for connector.py, tests, agent.yaml, README
│       ├── validator.py             # Guard rail: AST parse, import check, runs generated pytest in a subprocess
│       └── reporter.py              # Renders generation_report.md with readiness scores and validation status
│
├── samples/
│   ├── apex_digital/                # Nine-document demo corpus across five formats, ready to run immediately
│   │   ├── active_clients.csv
│   │   ├── billing_and_invoicing.txt
│   │   ├── client_onboarding_checklist.txt
│   │   ├── monthly_reporting_sop.txt
│   │   ├── ops_notes_slack_export.html
│   │   ├── team_handbook.txt
│   │   ├── tech_audit_notes.md
│   │   ├── tool_subscriptions.csv
│   │   └── use_cases_apex.json      # Example automation goals matching the apex_digital dataset
│   └── .gitkeep
│
├── tests/                           # 58 offline unit tests, no API calls, no network required
│   ├── test_ingestion.py            # Format parser tests across all supported file types
│   ├── test_extraction.py           # LLM extraction prompt construction and response parsing tests
│   ├── test_verification.py         # Anti-hallucination guard rail and substring-matching tests
│   ├── test_gap_analysis.py         # Gap mapper, inventory grounder, and effort scorer tests
│   └── test_codegen.py              # Connector rendering and validation tests (physically renders and executes connectors)
│
├── inputs/                          # PUT YOUR DOCUMENTS HERE before running --input ./inputs
│
└── outputs/                         # All generated artifacts appear here after a run
    ├── system_inventory.json
    ├── report.md
    ├── gap_analysis.json
    ├── gap_analysis_report.md
    ├── generation_manifest.json
    ├── generation_report.md
    ├── pipeline_log.jsonl
    └── connectors/
        └── <gap_id>/
            ├── connector.py
            ├── test_connector.py
            ├── agent.yaml
            ├── README.md
            └── requirements.txt

How to Run It

Step 1: Install

python -m pip install -r requirements.txt

Step 2: Configure API Key

cp .env.example .env

Open .env and fill in your GROQ_API_KEY. Then verify the setup works:

python main.py --check
# Expected output: "Configuration valid. API connection OK."

Step 3: Choose Your Input

Option A: Run on the included sample (no files needed)

The samples/apex_digital/ folder contains nine real-format documents for a fictional example company. Use this to see the full pipeline working immediately, with no setup beyond the API key.

python main.py --input ./samples/apex_digital --output ./outputs

Option B: Run on your own company documents

Before running, copy your documents into the ./inputs/ folder. Supported formats include PDF, Word (.docx), Excel (.xlsx), HTML, PNG, JPG, CSV, and plain text. Formats can be mixed freely: the agent detects each file's type from its content, not its extension.

python main.py --input ./inputs --output ./outputs

Step 4: Run the Full Pipeline (all three levels)

Add --level 3 to run Systems Discovery, Gap Analysis, and Code Generation in a single command. The --use-cases flag is optional but recommended: it tells the agent which workflows your business actually cares about, producing more relevant gap rankings. Without it, the agent infers generic workflow goals from the discovered inventory.

# With a use-cases file (recommended)
python main.py --input ./inputs --output ./outputs --level 3 --use-cases ./samples/apex_digital/use_cases_apex.json --max-connectors 3

# Without a use-cases file (agent infers goals from the inventory)
python main.py --input ./inputs --output ./outputs --level 3 --max-connectors 3

After the run, open report.md, gap_analysis_report.md, and generation_report.md in ./outputs/ for human-readable summaries of each level.

Step 5: Skip Already-Completed Levels

If Level 1 has already been run, there is no need to reprocess all documents to run Level 2. The standalone flags let the pipeline pick up from an existing artifact:

# Run only Level 2 using an existing inventory
python main.py --inventory ./outputs/system_inventory.json --use-cases ./samples/apex_digital/use_cases_apex.json

# Run only Level 3 using an existing gap analysis
python main.py --gaps ./outputs/gap_analysis.json --inventory ./outputs/system_inventory.json

CLI Reference

Flag Default What it does
--input PATH None Folder of documents to process through Level 1.
--level, -l N 1 Pipeline depth to run: 1, 2, or 3.
--use-cases PATH None Optional JSON file of target automation workflows. Without it, the agent infers goals from the inventory.
--inventory PATH None Skip Level 1 and start Level 2 from an existing system_inventory.json.
--gaps PATH None Skip Levels 1 and 2 and start Level 3 from an existing gap_analysis.json.
--max-connectors N 3 Maximum number of connectors to generate in Level 3.
--output PATH ./outputs Directory where all generated files are written.
--verbose, -v off Stream structured pipeline events to stdout as the run progresses.
--check N/A Validate config and test the API key, then exit without running the pipeline.

Output Files

File Produced by Who reads it What is inside
system_inventory.json Level 1 Downstream systems and Level 2 Full inventory with evidence quotes, source file references, and run metadata.
report.md Level 1 Human reviewer System tables, extracted evidence, confidence tiers, and run summary.
gap_analysis.json Level 2 Downstream systems and Level 3 Prioritized gap list with effort estimates, dependency map, and connection status per link.
gap_analysis_report.md Level 2 Human reviewer Layered summary and Mermaid integration diagrams showing what is connected versus missing.
connectors/<gap_id>/ Level 3 Engineer implementing the integration Full connector package: code, tests, agent definition, setup instructions.
generation_manifest.json Level 3 Downstream systems Connector specs, readiness scores, and validation proof (pass/fail plus test output).
generation_report.md Level 3 Human reviewer Summary of connectors built, readiness scores, and test results.
pipeline_log.jsonl All levels Auditor or debugger One structured JSON entry per architectural decision across the full run.

Edge Case Handling

Document Ingestion (Level 1)

Scenario How it is handled
File extension does not match actual format Magic byte detection identifies the real type before parsing begins.
Corrupted, empty, or zero-byte file Skipped with a log entry; does not crash the run.
Password-protected PDF Caught and skipped with a clear error message in the log.
Scanned PDF with no embedded text Routed to the vision model for image transcription.
Excel formula strings Evaluated to their displayed value before extraction.
Uncommon text encodings UTF-8 attempted first, with fallback to latin-1 and chardet detection.
File too large for a single LLM context Split into overlapping chunks; findings are merged after verification.

LLM Calls (All Levels)

Scenario How it is handled
Model returns a fabricated evidence quote Rejected by the substring guard rail before the finding enters the inventory.
Model invents a system not mentioned in any document Rejected because no quote can be verified against source text.
Model returns invalid or malformed JSON Bracket-slice recovery attempts repair; if unrecoverable, the call is retried.
API rate limit (HTTP 429) Exponential backoff with jitter; automatic fallback to the next model in the chain.
API 5xx error or connection timeout Same backoff chain; event logged to pipeline_log.jsonl.
Prompt injection payload in document text Document content is strictly delimited in every prompt; payload cannot escape to modify instructions.

Code Generation (Level 3)

Scenario How it is handled
LLM invents API endpoint paths Registry facts override the model. Unknown APIs are flagged for human review.
Generated Python has syntax errors ast.parse catches them; connector is written but marked failing, readiness capped at 45%.
Duplicate field mappings in LLM output Deduplicated by destination field in code before the template renders.
Varied pagination styles across APIs Unified iterator handles cursor, page-number, and Link-header styles transparently.
Hardcoded credentials in generated code Prevented by template design: credentials always read from environment variables.
Gap requires a system not in the inventory Marked as blocked and skipped during code generation.
Generated test suite hangs Validation subprocess is killed after a strict wall-clock timeout.

Challenge Acceptance Criteria

Level 1: Systems Discovery

Criterion Status Detail
Process mixed document formats VERIFIED 13+ formats supported via magic-byte routing.
Extract the majority of systems VERIFIED 27 systems found on the demo corpus.
Capture specific metadata fields VERIFIED Name, category, auth type, entities, processes, and criticality per system.
Tiered confidence scoring VERIFIED Deterministic and code-computed, never asked of the model.
Note inferences and missing data VERIFIED Inference flags appended to every sub-HIGH confidence finding.
JSON output with source references VERIFIED Source file, page number, and verified verbatim quote per finding.
Zero hallucinations VERIFIED Substring guard rail rejects any claim the source text does not support.

Level 2: Gap Analysis

Criterion Status Detail
Map use cases to systems VERIFIED Every workflow grounded against the Level 1 inventory.
Trace data flows VERIFIED Source system, destination system, entity type, and trigger captured per flow.
Mark connection status VERIFIED Live, Missing, or Blocked: determined by inventory evidence, not model output.
Estimate implementation effort VERIFIED Hours computed by code formula, not LLM-estimated.
Prioritize by business impact VERIFIED Ranked by a code-computed multiplier combining time saved and system criticality.
Dependency mapping VERIFIED Prerequisite blocking logic prevents impossible gaps from being recommended.

Level 3: Code Generation

Criterion Status Detail
Working Python code VERIFIED Every generated connector imports cleanly and executes without error.
Core API logic implemented VERIFIED Auth, CRUD, pagination, rate limiting, retries, and typed error handling present in every connector.
Unit tests included VERIFIED Minimum 7 tests per connector, auto-executed during the validation guard rail.
Agent definition included VERIFIED agent.yaml generated with real tool bindings to the connector methods.
Setup instructions VERIFIED Per-connector README.md and requirements.txt generated automatically.
70 to 80 percent production-ready VERIFIED Readiness score computed from registry versus AI-inferred logic split.

Test Suite

python -m pytest

58 offline unit tests. No API calls are made during the test run: all LLM interactions are mocked at the boundary. The Level 3 tests are an exception: they physically render connectors using real templates and execute the generated pytest suite in a sandbox to prove the output is runnable code, not just syntactically valid text.

Test file coverage:

File What it tests
test_ingestion.py Format router, each parser type, magic byte detection, malformed file handling
test_extraction.py LLM prompt construction, response parsing, Pydantic schema validation
test_verification.py Quote guard rail, substring matching, rejection of fabricated quotes
test_gap_analysis.py Workflow mapper, inventory grounder, effort scorer, ROI calculation
test_codegen.py Spec registry lookup, template rendering, AST validation, generated test execution

Security

Control Implementation
No hardcoded secrets API keys loaded exclusively from .env; .env is in .gitignore and excluded from every generated artifact.
File type validation Magic byte check runs before any parser is invoked: a renamed malicious file cannot reach the parsing layer.
Prompt injection defense Document text is wrapped in strict delimiters inside every prompt. Content inside the delimiters cannot escape to modify the instruction.
No eval or exec LLM output is always parsed as JSON via Pydantic. It is never passed to eval(), exec(), or subprocess with shell=True.
Generated connector safety Connector templates read credentials from environment variables. The LLM cannot produce hardcoded secrets through the template system.
Subprocess sandboxing The validation subprocess that runs generated tests has a wall-clock timeout and no access to the parent process environment.

Known Limitations

Field mappings are LLM-inferred. The connector scaffolding (auth, pagination, retries) is deterministic and reliable. The data field mapping between two APIs is the model's best guess at what corresponds to what. Every generated connector explicitly flags this for human engineering review before production use.

Registry coverage. The curated spec_registry.py covers the most common business SaaS APIs. Systems outside the registry are handled by AI inference: connectors for those systems get lower readiness scores and are explicitly marked as unverified. Adding a new API to the registry is straightforward and documented inside spec_registry.py.

Groq rate limits at scale. Large document corpora (100 or more files) will hit Groq's per-model rate limits. The system handles this gracefully: it backs off, rotates to the fallback model chain, and continues. Total processing time increases for large corpora. Run with --verbose to monitor progress.


About

An autonomous, zero-hallucination agent that analyzes unstructured corporate documents to discover software systems, identify integration gaps, and generate production-ready connector automation code.

Resources

Stars

Watchers

Forks

Contributors

Languages