Skip to content

NullLabTests/emergence_observatory

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

17 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Python 3.10+ MIT License Mistral LLM Persistent agents JSONL persistence CI passing 15-50 agents 20-100+ ticks Experimental


πŸ”¬ Emergence Observatory

An LLM-native multi-agent laboratory for studying emergent social behavior β€” vocabulary formation, proposals and voting, knowledge sharing, and group dynamics through real Mistral API calls.

Recent: Semantic drift tracking (meanings evolve through LLM reinterpretation), parallel multi-seed runner, linguistic analysis (Zipf/Heaps), contagion modeling, auto LaTeX reporting, CI pipeline, and a live Flask dashboard. See full changelog.

About Β· Live Demo Β· Architecture Β· How It Works Β· Experiments Β· Real Interactions Β· Quick Start


🧬 What This Is

This is not an AGI project. It is not an autonomous coding framework. This is a scientific instrument β€” a controlled laboratory for observing whether and how collective behaviors emerge from populations of autonomous LLM-backed agents.

The Core Idea

Agents with persistent identities, personalities, goals, and memories inhabit a shared grid world. They move, gather resources, invent words with original definitions, propose and vote on governance norms, research topics, and share knowledge through a hivemind β€” all driven by real Mistral API calls with rate limiting, retry logic, and actual latency costs.

Key Differentiators

  • Open-ended invented vocabulary β€” agents create language de novo with original definitions, not select from a fixed set
  • Voting embedded in ongoing life β€” governance is one of 16 actions, not a separate deliberation phase
  • Word lifecycle tracking β€” birth β†’ peak adoption β†’ extinction, measured per seed
  • Real API calls β€” actual network latency, rate limits, and failures, not simulation
  • Reproducible experiments β€” multi-seed runner with condition comparison and per-seed CSVs

Primary Research Questions

Phenomenon What We Measure Instrument
πŸ—£οΈ Vocabulary formation Newly invented words, definitions, adoption rate, survival, extinction experiments/novelty_ledger.py
πŸ—³οΈ Proposal & voting Norms proposed, votes cast, quorum reached, adopted norms over time cognition/proposal_system.py
πŸ“š Knowledge sharing LLM-powered research findings, hivemind contributions, information propagation cognition/serper_bridge.py
πŸ‘₯ Group formation Groups formed, shared purpose, membership duration, leadership core/agent.py
πŸ›οΈ Cultural persistence Word lifetimes, norm stickiness, alliance durability metrics/collector.py
πŸ•ΈοΈ Social networks Relationship graph, affinity scores, communication structure core/agent.py

πŸ“Š Live Dashboard

A real-time Flask SSE dashboard at http://127.0.0.1:5000 β€” all three files in viz/ are fully implemented and wired to the simulation:

pip install flask
python run.py --agents 20 --batch 5
# Open http://127.0.0.1:5000

The dashboard streams live snapshots via Server-Sent Events β€” agent positions, metrics, conversations, proposals, knowledge topics β€” all updating in real time:

Emergence Observatory live dashboard

Dashboard Panels

Panel Data Source
World map Canvas rendering of agent positions, color-coded by energy
Conversation log Every utterance verbatim with speaker and reasoning
Vocabulary tracker Invented words with definitions and adoption counts
Proposal board Open/closed proposals, YEA/NAY counts, passed norms
Metric cards Tick, agents, energy, vocab size, groups, norms, research, votes
Knowledge topics Hivemind contribution topics

πŸ—οΈ Architecture

flowchart TB
    subgraph Core["core/"]
        AG[agent.py<br/>Persistent identity,<br/>personality, memories]
        WO[world.py<br/>Grid world, resources,<br/>locations]
        SI[simulation.py<br/>Tick loop orchestrator]
    end

    subgraph Cognition["cognition/"]
        MB[mistral_bridge.py<br/>API client, rate limit, retry]
        CS[cognition_service.py<br/>Prompt builder, dispatch]
        PR[prompts.py<br/>System prompts, action templates]
        PS[proposal_system.py<br/>Voting registry, quorum]
        SB[serper_bridge.py<br/>LLM-powered research]
    end

    subgraph Memory["memory/"]
        MS[memory_store.py<br/>JSON file persistence]
    end

    subgraph Metrics["metrics/"]
        MC[collector.py<br/>Emergence metrics]
    end

    subgraph Replay["replay/"]
        RC[recorder.py<br/>JSONL interaction log]
        PL[player.py<br/>Post-hoc replay]
    end

    subgraph Experiments["experiments/"]
        RN[runner.py<br/>Multi-seed runner]
        NL[novelty_ledger.py<br/>Word lifecycle tracker]
        VB[voting_vs_baseline/<br/>Experiment data]
    end

    subgraph Viz["viz/"]
        AP[app.py<br/>Flask SSE server]
        TP[templates/index.html]
        SV[static/viz.js]
    end

    SI --> AG & WO & CS
    CS --> MB & PS & SB
    SI --> MC & RC
    AG --> MS
    RN --> SI & NL
    AP --> SI
Loading

βš™οΈ How It Works

The Tick Loop

flowchart LR
    R[Regenerate<br/>Resources] --> E[Process<br/>Word Extinction]
    E --> V[Process<br/>Proposal Voting]
    V --> A[Select N<br/>Active Agents]
    A --> L{For Each Agent}
    L --> C[Build Context<br/>Location Β· Memories Β· Goals Β· Proposals]
    C --> M[Call Mistral LLM<br/>β†’ Structured JSON Action]
    M --> X[Execute Action<br/>One of 16]
    X --> P[Persist State<br/>to JSON Files]
    P --> L
    L -->|Done| M2[Compute Metrics<br/>Vocab Β· Norms Β· Groups Β· Graph]
    M2 --> J[Append to<br/>JSONL Replay Log]
    J --> D[Push Snapshot<br/>to Dashboard via SSE]
    D --> R
Loading

Agent Model

Each agent is a persistent object stored as JSON, carrying state across ticks:

Field Type Example
agent_idint5
personality_traitslist[str]["curious", "generous", "inventive"]
biographystr"Born from light in the crystal cave..."
goalslist[str]["Map the eastern plains", "Build a community"]
memory.short_termlistLast 20 experiences
memory.episodiclistUp to 100 consolidated memories
memory.relationshipsdict[int, float]{2: 0.8, 7: -0.2, 12: 0.5}
vocabularydict[str, str]{"lumi": "the dancing light...", "veth": "to seek..."}
knowledge_baselist[str]Hivemind research contributions
social_rankfloat3.2
group_idint or NoneNone
position(int, int)(43, 1)
energyfloat98.3

16 Agent Actions

Every decision is structured JSON returned by the LLM:

{
  "action": "invent_word",
  "params": {
    "word": "lumi",
    "meaning": "the dancing light I first saw in the crystal cave"
  },
  "reasoning": "This word can help me share the memory and beauty I experienced."
}
Click to see all 16 actions with examples from real runs
Action Description Real Example
move Travel to a location β†’ (12, 5)
speak Communicate with an agent β†’ Agent 2: "I found blue crystals by the river"
gather Collect resources β†’ +3 wood, +1 stone
remember Consolidate a memory β†’ "The light taught me awareness"
teach Share knowledge (with optional meaning β€” enables semantic drift) β†’ Agent 7 learns "lumi" as "shimmering light"
follow Follow a nearby agent β†’ Following Agent 3
share_resource Give resources β†’ Gives 2 wood to Agent 8
invent_word Create a word with a meaning β†’ "veth" = "the act of seeking or searching"
cooperate Form an alliance β†’ Alliance with Agent 9
propose Submit a governance norm β†’ "Foundational Laws for Fairness and Loyalty"
vote Cast a vote β†’ YEA on proposal #2
research Research via LLM's training knowledge β†’ "what is light" β†’ 3 findings
hivemind Share knowledge with collective β†’ Contributes to shared pool
form_group Propose a social group β†’ "Let us form the Explorers Guild"
join_group Join an existing group β†’ Joins group #1
ignore Do nothing β†’ Waits

πŸ”¬ Experiments

Controlled experiments live in experiments/. Each experiment varies one parameter, runs 3+ seeds per condition, and writes per-seed metrics + a novelty ledger + summary CSVs.

Experiments

Controlled experiments compare three governance conditions:

Condition proposals_enabled vote_ticks_open What it measures
no_proposals false β€” Baseline without any governance deliberation
baseline true 9999 Deliberation without enactment (proposals never close)
voting true 6 (quorum 25%) Full deliberation + enactment

Latest: Voting vs Baseline (6 seeds, 15 agents, 20 ticks)

Vocabulary growth

Metrics comparison

Metric Baseline (3 runs) Voting (3 runs) Interpretation
Vocab size (tick 20) 86.3 78.3 Similar linguistic capacity
Words invented 11.3 11.0 Voting does not suppress creativity
Mean word lifetime 16.2 15.8 No extinction yet (short run)
Passed norms 0.0 0.33 Voting enables governance
Alliances / Groups 0 0 Need longer runs
LLM failures 0 0 Zero across 600 calls

Key finding: Voting enables norm passage without suppressing linguistic innovation. See papers/preliminary_findings.md.

Analysis Tools

After an experiment completes, run the analysis pipeline:

# Linguistic: Zipf Ξ±, Heaps Ξ², between-condition Mann-Whitney U
python experiments/linguistic_analysis.py -d experiments/<name>

# Semantic drift: meaning consensus, drift magnitude, propagation
python experiments/semantic_drift.py -d experiments/<name>

# Contagion: SIR adoption curves, growth rate, carrying capacity
python experiments/contagion.py -d experiments/<name>

Parallel Runner

Run seeds in parallel across CPU cores (cuts wall time by ~workers):

python experiments/parallel_runner.py --name my_experiment \
  --runs 10 --ticks 50 --agents 20 --batch 10 --workers 4

Performance

Each LLM call takes ~1.5–2.5s. With 300 RPM and 4 parallel workers:

Configuration LLM calls Wall time (sequential) Wall time (4 workers)
10 seeds Γ— 50 ticks 5,000 ~4 h ~1 h
3 seeds Γ— 20 ticks 300 ~15 min ~5 min

All runs write per-seed CSV metrics, novelty ledger JSON, drift snapshots (per-tick meaning maps), and a comparison summary to disk.

Full Analysis Pipeline

After an experiment completes, generate all outputs with:

# Linguistic stats + between-condition tests
python experiments/linguistic_analysis.py -d experiments/<name>

# Semantic drift: meaning consensus, drift magnitude
python experiments/semantic_drift.py -d experiments/<name>

# Contagion: SIR adoption curves, growth rate
python experiments/contagion.py -d experiments/<name>

# Matplotlib charts (vocab growth, comparison bars, adoption)
python scripts/plot_results.py -d experiments/<name>

# LaTeX tables for paper
python papers/generate_report.py -d experiments/<name>
pdflatex experiments/<name>/report.tex

All analysis tools produce structured JSON + human-readable terminal output.


πŸ—£οΈ Real Interactions

Actual output from a 15-agent, 20-tick run using Mistral Large:

Invented Words

Every word is created spontaneously by an agent with an original definition:

Tick  1  Agent 5  β†’ "lumi"    = "the dancing light I first saw in the crystal cave"
Tick  1  Agent 6  β†’ "Lumis"   = "the dancing light I first saw through the rocks, the spark of awareness"
Tick  1  Agent 8  β†’ "Lumin"   = "the dancing light I first saw through the ancient tree, a symbol of awareness"
Tick  1  Agent 2  β†’ "Veld"    = "open field or grassland"
Tick  2  Agent 0  β†’ "lumi"    = "the dancing light I first saw through the flower field"
Tick  3  Agent 12 β†’ "suna"    = "sand or sandy place"
Tick  3  Agent 3  β†’ "Vex"     = "a call to gather or assemble for leadership discussion"
Tick  5  Agent 4  β†’ "Ael"     = "the act of opening one's eyes for the first time in this world"
Tick  5  Agent 1  β†’ "Togeth"  = "a state of unity and shared purpose among agents"
Tick  7  Agent 7  β†’ "Lumen"   = "the light that dances through crystals, or any beautiful light"
Tick 12  Agent 4  β†’ "veth"    = "the act of seeking or searching for others in this world"

Notice how multiple agents independently invented variations on "lumi" (light) β€” a convergent linguistic theme driven by shared experience of first awakening. This is a form of emergent semantic consensus without explicit coordination.

Proposals

Agents propose governance norms for group-wide voting:

Tick Proposer Title
15 Agent 7 "Foundational Laws for Fairness and Loyalty"
15 Agent 3 "First Gathering for Leadership Discussion"
15 Agent 5 "Monument to Lumi: The First Collective Creation"
18 Agent 2 "Veld Resource Mapping Initiative"
19 Agent 1 "The Path to Togeth"
20 Agent 6 "The Lumis Covenant"

Voting

When proposals open for voting, agents cast YEA/NAY with reasoning:

Agent 2 votes YEA on "First Gathering for Leadership Discussion": "Leadership and coordination will help attract other agents and manage resources effectively."

Agent 8 votes YEA on "Foundational Laws for Fairness and Loyalty": "Establishing foundational laws is critical for order and fairness."

Agent 5 votes YEA on "Monument to Lumi: The First Collective Creation": "Building a monument to lumi aligns with my long-term goal of creating collective achievements."

Research

When agents research via web search (or synthetic fallback), they ask fundamental questions:

Agent 13 searches "what is light" at Tick 1: "My first memory involves light dancing through a river bank. Understanding light could be key to wisdom."

Agent 8 searches "how to unite agents under common purpose" at Tick 2: "I wish to build a community and need to understand how to bring agents together."

Agent 8 searches "how to establish laws and governance among agents" at Tick 3: "I saw that agents have different goals; governance can help coordinate our actions."

Agent 10 searches "meaning of this world" at Tick 4: "To understand my purpose, I must first understand where we are."


πŸš€ Quick Start

git clone git@github.com:NullLabTests/emergence_observatory.git
cd emergence_observatory
pip install -r requirements.txt

export MISTRAL_API_KEY="your-key-here"
python run.py --agents 20 --batch 5 --port 5000

Open http://127.0.0.1:5000 to watch the lab in real time.

Command-Line Options

Flag Default Description
--agents 100 Population size
--width 80 World width
--height 60 World height
--batch 10 Agents acting per tick (higher = more LLM calls/tick)
--max-ticks 200 Maximum ticks (set 10000 for infinite)
--model mistral-large-latest Mistral model name
--rpm 120 LLM API rate limit
--no-llm off Disable LLM (dry-run with no API cost)
--no-viz off Headless mode (no web server, CLI output)
--port 5000 Dashboard HTTP port
--vote-ticks 8 Ticks a proposal stays open
--quorum 0.25 Fraction of agents needed to close a proposal
--tick-interval 2.0 Seconds between ticks

πŸ“ Project Structure

emergence_observatory/          # Python package (root)
β”œβ”€β”€ emergence_observatory/      # Source package
β”‚   β”œβ”€β”€ core/                   # Core simulation engine
β”‚   β”‚   β”œβ”€β”€ agent.py            # Persistent agent model
β”‚   β”‚   β”œβ”€β”€ world.py            # Grid world with resources and locations
β”‚   β”‚   └── simulation.py       # Tick loop orchestration
β”‚   β”œβ”€β”€ cognition/              # LLM integration
β”‚   β”‚   β”œβ”€β”€ mistral_bridge.py   # Mistral API client with rate limiting & retry
β”‚   β”‚   β”œβ”€β”€ cognition_service.py # Shared LLM service β€” prompt builder, dispatcher
β”‚   β”‚   β”œβ”€β”€ prompts.py          # System prompts and action templates
β”‚   β”‚   β”œβ”€β”€ proposal_system.py  # Voting registry, quorum, norm tracking
β”‚   β”‚   └── serper_bridge.py    # LLM-powered research (no external API)
β”‚   β”œβ”€β”€ memory/
β”‚   β”‚   └── memory_store.py     # JSON-file-backed persistence
β”‚   β”œβ”€β”€ metrics/
β”‚   β”‚   └── collector.py        # Emergence metrics β€” vocab, norms, groups
β”‚   β”œβ”€β”€ replay/
β”‚   β”‚   β”œβ”€β”€ recorder.py         # JSONL interaction log
β”‚   β”‚   └── player.py           # Post-hoc replay viewer
β”‚   └── viz/                    # Flask SSE dashboard (live)
β”‚       β”œβ”€β”€ app.py              # Flask SSE server
β”‚       β”œβ”€β”€ templates/index.html
β”‚       └── static/viz.js
β”œβ”€β”€ experiments/                # Experiment infrastructure
β”‚   β”œβ”€β”€ runner.py               # Multi-seed orchestrator (3 conditions)
β”‚   β”œβ”€β”€ parallel_runner.py      # Multi-process version (workers=N)
β”‚   β”œβ”€β”€ novelty_ledger.py       # Word lifecycle tracker
β”‚   β”œβ”€β”€ linguistic_analysis.py  # Zipf Ξ±, Heaps Ξ², Mann-Whitney U
β”‚   β”œβ”€β”€ semantic_drift.py       # DriftRecorder + meaning drift analysis
β”‚   β”œβ”€β”€ contagion.py            # SIR adoption curves, growth rate
β”‚   └── voting_vs_baseline/     # Experiment 1 data and SVGs
β”œβ”€β”€ scripts/
β”‚   └── plot_results.py         # Matplotlib charts from experiment output
β”œβ”€β”€ papers/
β”‚   β”œβ”€β”€ preliminary_findings.md
β”‚   └── generate_report.py      # LaTeX table generator
β”œβ”€β”€ tests/                      # 17 tests (pytest)
β”‚   β”œβ”€β”€ test_core.py
β”‚   β”œβ”€β”€ test_experiments.py
β”‚   └── test_research.py
β”œβ”€β”€ .github/workflows/test.yml  # CI pipeline (GitHub Actions)
β”œβ”€β”€ run.py                      # CLI entry point
└── setup.py                    # pip installable

πŸ”§ Notable Fixes

  • nearby_agents() no longer returns empty β€” The LLM prompt now correctly lists agents within 6 tiles. Previously this always returned [], meaning agents were socially blind. Simulation now populates world._agent_cache each tick.
  • serper_bridge.py β€” Uses LLM knowledge directly (no external API). Fixed reason_raw() keyword argument mismatch.
  • Semantic drift β€” teach action now accepts an optional meaning param from the LLM, enabling telephone-game meaning evolution. Previously meanings were copied verbatim.
  • invent_word β€” No longer blocked by vocabulary; agents can assign their own meanings to words heard in speech.

πŸ§ͺ Extensibility

Direction How Key Files
🧠 Better memory Implement consolidation, decay, narrative compression core/agent.py, memory/memory_store.py
🌍 Richer world Add dynamic events, seasons, obstacles, NPCs, terrain types core/world.py
πŸ€– Different LLM Subclass MistralBridge for any OpenAI-compatible API cognition/mistral_bridge.py
πŸ“Š New metrics Add custom metrics to MetricsCollector.collect() metrics/collector.py
🎭 Agent heterogeneity Vary capabilities, personality distributions, initial resources core/agent.py, cognition/prompts.py
πŸ”„ Cultural evolution Implement prestige bias, conformity, teaching fidelity, status effects cognition/cognition_service.py
πŸ“ Statistical rigour Run experiments/runner.py with multiple seeds and conditions experiments/runner.py
πŸ—³οΈ New governance Add ranked-choice voting, delegate systems, constitutional evolution cognition/proposal_system.py
πŸ”— Social network topology Constrain communication to network edges (small-world, scale-free, etc.) core/simulation.py
πŸ§ͺ Experiment library Add new experiment configurations in experiments/ experiments/runner.py
πŸ“ˆ Semantic drift Track meaning evolution; LLM reinterprets on teach β€” telephone-game effect experiments/semantic_drift.py, core/simulation.py
🦠 Contagion analysis Fit SIR models to norm/word adoption curves; critical mass detection experiments/contagion.py
πŸ“Š Auto-reporting LaTeX table generation + matplotlib plots from experiment output papers/generate_report.py, scripts/plot_results.py
⚑ Parallel execution Multi-process seed runner for faster experimentation experiments/parallel_runner.py

πŸ“„ License

MIT β€” free for any use, commercial or academic.


πŸ› Report a bug Β· πŸ’‘ Start a discussion Β· ⭐ Star the repo

Built with Python Β· Mistral API Β· Flask Β· inspired by Stanford's Generative Agents and the naming game tradition

About

πŸ”¬ LLM-native multi-agent laboratory β€” study vocabulary formation, social networks, alliances, and cultural emergence through actual natural language conversations between Mistral-backed agents

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors