An LLM-native multi-agent laboratory for studying emergent social behavior β vocabulary formation, proposals and voting, knowledge sharing, and group dynamics through real Mistral API calls.
Recent: Semantic drift tracking (meanings evolve through LLM reinterpretation), parallel multi-seed runner, linguistic analysis (Zipf/Heaps), contagion modeling, auto LaTeX reporting, CI pipeline, and a live Flask dashboard. See full changelog.
About Β· Live Demo Β· Architecture Β· How It Works Β· Experiments Β· Real Interactions Β· Quick Start
This is not an AGI project. It is not an autonomous coding framework. This is a scientific instrument β a controlled laboratory for observing whether and how collective behaviors emerge from populations of autonomous LLM-backed agents.
|
Agents with persistent identities, personalities, goals, and memories inhabit a shared grid world. They move, gather resources, invent words with original definitions, propose and vote on governance norms, research topics, and share knowledge through a hivemind β all driven by real Mistral API calls with rate limiting, retry logic, and actual latency costs. |
|
| Phenomenon | What We Measure | Instrument |
|---|---|---|
| π£οΈ Vocabulary formation | Newly invented words, definitions, adoption rate, survival, extinction | experiments/novelty_ledger.py |
| π³οΈ Proposal & voting | Norms proposed, votes cast, quorum reached, adopted norms over time | cognition/proposal_system.py |
| π Knowledge sharing | LLM-powered research findings, hivemind contributions, information propagation | cognition/serper_bridge.py |
| π₯ Group formation | Groups formed, shared purpose, membership duration, leadership | core/agent.py |
| ποΈ Cultural persistence | Word lifetimes, norm stickiness, alliance durability | metrics/collector.py |
| πΈοΈ Social networks | Relationship graph, affinity scores, communication structure | core/agent.py |
A real-time Flask SSE dashboard at http://127.0.0.1:5000 β all three files in viz/ are fully implemented and wired to the simulation:
pip install flask
python run.py --agents 20 --batch 5
# Open http://127.0.0.1:5000The dashboard streams live snapshots via Server-Sent Events β agent positions, metrics, conversations, proposals, knowledge topics β all updating in real time:
| Panel | Data Source |
|---|---|
| World map | Canvas rendering of agent positions, color-coded by energy |
| Conversation log | Every utterance verbatim with speaker and reasoning |
| Vocabulary tracker | Invented words with definitions and adoption counts |
| Proposal board | Open/closed proposals, YEA/NAY counts, passed norms |
| Metric cards | Tick, agents, energy, vocab size, groups, norms, research, votes |
| Knowledge topics | Hivemind contribution topics |
flowchart TB
subgraph Core["core/"]
AG[agent.py<br/>Persistent identity,<br/>personality, memories]
WO[world.py<br/>Grid world, resources,<br/>locations]
SI[simulation.py<br/>Tick loop orchestrator]
end
subgraph Cognition["cognition/"]
MB[mistral_bridge.py<br/>API client, rate limit, retry]
CS[cognition_service.py<br/>Prompt builder, dispatch]
PR[prompts.py<br/>System prompts, action templates]
PS[proposal_system.py<br/>Voting registry, quorum]
SB[serper_bridge.py<br/>LLM-powered research]
end
subgraph Memory["memory/"]
MS[memory_store.py<br/>JSON file persistence]
end
subgraph Metrics["metrics/"]
MC[collector.py<br/>Emergence metrics]
end
subgraph Replay["replay/"]
RC[recorder.py<br/>JSONL interaction log]
PL[player.py<br/>Post-hoc replay]
end
subgraph Experiments["experiments/"]
RN[runner.py<br/>Multi-seed runner]
NL[novelty_ledger.py<br/>Word lifecycle tracker]
VB[voting_vs_baseline/<br/>Experiment data]
end
subgraph Viz["viz/"]
AP[app.py<br/>Flask SSE server]
TP[templates/index.html]
SV[static/viz.js]
end
SI --> AG & WO & CS
CS --> MB & PS & SB
SI --> MC & RC
AG --> MS
RN --> SI & NL
AP --> SI
flowchart LR
R[Regenerate<br/>Resources] --> E[Process<br/>Word Extinction]
E --> V[Process<br/>Proposal Voting]
V --> A[Select N<br/>Active Agents]
A --> L{For Each Agent}
L --> C[Build Context<br/>Location Β· Memories Β· Goals Β· Proposals]
C --> M[Call Mistral LLM<br/>β Structured JSON Action]
M --> X[Execute Action<br/>One of 16]
X --> P[Persist State<br/>to JSON Files]
P --> L
L -->|Done| M2[Compute Metrics<br/>Vocab Β· Norms Β· Groups Β· Graph]
M2 --> J[Append to<br/>JSONL Replay Log]
J --> D[Push Snapshot<br/>to Dashboard via SSE]
D --> R
Each agent is a persistent object stored as JSON, carrying state across ticks:
| Field | Type | Example |
|---|---|---|
agent_id | int | 5 |
personality_traits | list[str] | ["curious", "generous", "inventive"] |
biography | str | "Born from light in the crystal cave..." |
goals | list[str] | ["Map the eastern plains", "Build a community"] |
memory.short_term | list | Last 20 experiences |
memory.episodic | list | Up to 100 consolidated memories |
memory.relationships | dict[int, float] | {2: 0.8, 7: -0.2, 12: 0.5} |
vocabulary | dict[str, str] | {"lumi": "the dancing light...", "veth": "to seek..."} |
knowledge_base | list[str] | Hivemind research contributions |
social_rank | float | 3.2 |
group_id | int or None | None |
position | (int, int) | (43, 1) |
energy | float | 98.3 |
Every decision is structured JSON returned by the LLM:
{
"action": "invent_word",
"params": {
"word": "lumi",
"meaning": "the dancing light I first saw in the crystal cave"
},
"reasoning": "This word can help me share the memory and beauty I experienced."
}Click to see all 16 actions with examples from real runs
| Action | Description | Real Example |
|---|---|---|
move |
Travel to a location | β (12, 5) |
speak |
Communicate with an agent | β Agent 2: "I found blue crystals by the river" |
gather |
Collect resources | β +3 wood, +1 stone |
remember |
Consolidate a memory | β "The light taught me awareness" |
teach |
Share knowledge (with optional meaning β enables semantic drift) | β Agent 7 learns "lumi" as "shimmering light" |
follow |
Follow a nearby agent | β Following Agent 3 |
share_resource |
Give resources | β Gives 2 wood to Agent 8 |
invent_word |
Create a word with a meaning | β "veth" = "the act of seeking or searching" |
cooperate |
Form an alliance | β Alliance with Agent 9 |
propose |
Submit a governance norm | β "Foundational Laws for Fairness and Loyalty" |
vote |
Cast a vote | β YEA on proposal #2 |
research |
Research via LLM's training knowledge | β "what is light" β 3 findings |
hivemind |
Share knowledge with collective | β Contributes to shared pool |
form_group |
Propose a social group | β "Let us form the Explorers Guild" |
join_group |
Join an existing group | β Joins group #1 |
ignore |
Do nothing | β Waits |
Controlled experiments live in experiments/. Each experiment varies one parameter, runs 3+ seeds per condition, and writes per-seed metrics + a novelty ledger + summary CSVs.
Controlled experiments compare three governance conditions:
| Condition | proposals_enabled |
vote_ticks_open |
What it measures |
|---|---|---|---|
no_proposals |
false |
β | Baseline without any governance deliberation |
baseline |
true |
9999 |
Deliberation without enactment (proposals never close) |
voting |
true |
6 (quorum 25%) |
Full deliberation + enactment |
| Metric | Baseline (3 runs) | Voting (3 runs) | Interpretation |
|---|---|---|---|
| Vocab size (tick 20) | 86.3 | 78.3 | Similar linguistic capacity |
| Words invented | 11.3 | 11.0 | Voting does not suppress creativity |
| Mean word lifetime | 16.2 | 15.8 | No extinction yet (short run) |
| Passed norms | 0.0 | 0.33 | Voting enables governance |
| Alliances / Groups | 0 | 0 | Need longer runs |
| LLM failures | 0 | 0 | Zero across 600 calls |
Key finding: Voting enables norm passage without suppressing linguistic innovation. See papers/preliminary_findings.md.
After an experiment completes, run the analysis pipeline:
# Linguistic: Zipf Ξ±, Heaps Ξ², between-condition Mann-Whitney U
python experiments/linguistic_analysis.py -d experiments/<name>
# Semantic drift: meaning consensus, drift magnitude, propagation
python experiments/semantic_drift.py -d experiments/<name>
# Contagion: SIR adoption curves, growth rate, carrying capacity
python experiments/contagion.py -d experiments/<name>Run seeds in parallel across CPU cores (cuts wall time by ~workers):
python experiments/parallel_runner.py --name my_experiment \
--runs 10 --ticks 50 --agents 20 --batch 10 --workers 4Each LLM call takes ~1.5β2.5s. With 300 RPM and 4 parallel workers:
| Configuration | LLM calls | Wall time (sequential) | Wall time (4 workers) |
|---|---|---|---|
| 10 seeds Γ 50 ticks | 5,000 | ~4 h | ~1 h |
| 3 seeds Γ 20 ticks | 300 | ~15 min | ~5 min |
All runs write per-seed CSV metrics, novelty ledger JSON, drift snapshots (per-tick meaning maps), and a comparison summary to disk.
After an experiment completes, generate all outputs with:
# Linguistic stats + between-condition tests
python experiments/linguistic_analysis.py -d experiments/<name>
# Semantic drift: meaning consensus, drift magnitude
python experiments/semantic_drift.py -d experiments/<name>
# Contagion: SIR adoption curves, growth rate
python experiments/contagion.py -d experiments/<name>
# Matplotlib charts (vocab growth, comparison bars, adoption)
python scripts/plot_results.py -d experiments/<name>
# LaTeX tables for paper
python papers/generate_report.py -d experiments/<name>
pdflatex experiments/<name>/report.texAll analysis tools produce structured JSON + human-readable terminal output.
Actual output from a 15-agent, 20-tick run using Mistral Large:
Every word is created spontaneously by an agent with an original definition:
Tick 1 Agent 5 β "lumi" = "the dancing light I first saw in the crystal cave"
Tick 1 Agent 6 β "Lumis" = "the dancing light I first saw through the rocks, the spark of awareness"
Tick 1 Agent 8 β "Lumin" = "the dancing light I first saw through the ancient tree, a symbol of awareness"
Tick 1 Agent 2 β "Veld" = "open field or grassland"
Tick 2 Agent 0 β "lumi" = "the dancing light I first saw through the flower field"
Tick 3 Agent 12 β "suna" = "sand or sandy place"
Tick 3 Agent 3 β "Vex" = "a call to gather or assemble for leadership discussion"
Tick 5 Agent 4 β "Ael" = "the act of opening one's eyes for the first time in this world"
Tick 5 Agent 1 β "Togeth" = "a state of unity and shared purpose among agents"
Tick 7 Agent 7 β "Lumen" = "the light that dances through crystals, or any beautiful light"
Tick 12 Agent 4 β "veth" = "the act of seeking or searching for others in this world"
Notice how multiple agents independently invented variations on "lumi" (light) β a convergent linguistic theme driven by shared experience of first awakening. This is a form of emergent semantic consensus without explicit coordination.
Agents propose governance norms for group-wide voting:
| Tick | Proposer | Title |
|---|---|---|
| 15 | Agent 7 | "Foundational Laws for Fairness and Loyalty" |
| 15 | Agent 3 | "First Gathering for Leadership Discussion" |
| 15 | Agent 5 | "Monument to Lumi: The First Collective Creation" |
| 18 | Agent 2 | "Veld Resource Mapping Initiative" |
| 19 | Agent 1 | "The Path to Togeth" |
| 20 | Agent 6 | "The Lumis Covenant" |
When proposals open for voting, agents cast YEA/NAY with reasoning:
Agent 2 votes YEA on "First Gathering for Leadership Discussion": "Leadership and coordination will help attract other agents and manage resources effectively."
Agent 8 votes YEA on "Foundational Laws for Fairness and Loyalty": "Establishing foundational laws is critical for order and fairness."
Agent 5 votes YEA on "Monument to Lumi: The First Collective Creation": "Building a monument to lumi aligns with my long-term goal of creating collective achievements."
When agents research via web search (or synthetic fallback), they ask fundamental questions:
Agent 13 searches "what is light" at Tick 1: "My first memory involves light dancing through a river bank. Understanding light could be key to wisdom."
Agent 8 searches "how to unite agents under common purpose" at Tick 2: "I wish to build a community and need to understand how to bring agents together."
Agent 8 searches "how to establish laws and governance among agents" at Tick 3: "I saw that agents have different goals; governance can help coordinate our actions."
Agent 10 searches "meaning of this world" at Tick 4: "To understand my purpose, I must first understand where we are."
git clone git@github.com:NullLabTests/emergence_observatory.git
cd emergence_observatory
pip install -r requirements.txt
export MISTRAL_API_KEY="your-key-here"
python run.py --agents 20 --batch 5 --port 5000Open http://127.0.0.1:5000 to watch the lab in real time.
| Flag | Default | Description |
|---|---|---|
--agents |
100 |
Population size |
--width |
80 |
World width |
--height |
60 |
World height |
--batch |
10 |
Agents acting per tick (higher = more LLM calls/tick) |
--max-ticks |
200 |
Maximum ticks (set 10000 for infinite) |
--model |
mistral-large-latest |
Mistral model name |
--rpm |
120 |
LLM API rate limit |
--no-llm |
off | Disable LLM (dry-run with no API cost) |
--no-viz |
off | Headless mode (no web server, CLI output) |
--port |
5000 |
Dashboard HTTP port |
--vote-ticks |
8 |
Ticks a proposal stays open |
--quorum |
0.25 |
Fraction of agents needed to close a proposal |
--tick-interval |
2.0 |
Seconds between ticks |
emergence_observatory/ # Python package (root)
βββ emergence_observatory/ # Source package
β βββ core/ # Core simulation engine
β β βββ agent.py # Persistent agent model
β β βββ world.py # Grid world with resources and locations
β β βββ simulation.py # Tick loop orchestration
β βββ cognition/ # LLM integration
β β βββ mistral_bridge.py # Mistral API client with rate limiting & retry
β β βββ cognition_service.py # Shared LLM service β prompt builder, dispatcher
β β βββ prompts.py # System prompts and action templates
β β βββ proposal_system.py # Voting registry, quorum, norm tracking
β β βββ serper_bridge.py # LLM-powered research (no external API)
β βββ memory/
β β βββ memory_store.py # JSON-file-backed persistence
β βββ metrics/
β β βββ collector.py # Emergence metrics β vocab, norms, groups
β βββ replay/
β β βββ recorder.py # JSONL interaction log
β β βββ player.py # Post-hoc replay viewer
β βββ viz/ # Flask SSE dashboard (live)
β βββ app.py # Flask SSE server
β βββ templates/index.html
β βββ static/viz.js
βββ experiments/ # Experiment infrastructure
β βββ runner.py # Multi-seed orchestrator (3 conditions)
β βββ parallel_runner.py # Multi-process version (workers=N)
β βββ novelty_ledger.py # Word lifecycle tracker
β βββ linguistic_analysis.py # Zipf Ξ±, Heaps Ξ², Mann-Whitney U
β βββ semantic_drift.py # DriftRecorder + meaning drift analysis
β βββ contagion.py # SIR adoption curves, growth rate
β βββ voting_vs_baseline/ # Experiment 1 data and SVGs
βββ scripts/
β βββ plot_results.py # Matplotlib charts from experiment output
βββ papers/
β βββ preliminary_findings.md
β βββ generate_report.py # LaTeX table generator
βββ tests/ # 17 tests (pytest)
β βββ test_core.py
β βββ test_experiments.py
β βββ test_research.py
βββ .github/workflows/test.yml # CI pipeline (GitHub Actions)
βββ run.py # CLI entry point
βββ setup.py # pip installable
nearby_agents()no longer returns empty β The LLM prompt now correctly lists agents within 6 tiles. Previously this always returned[], meaning agents were socially blind. Simulation now populatesworld._agent_cacheeach tick.serper_bridge.pyβ Uses LLM knowledge directly (no external API). Fixedreason_raw()keyword argument mismatch.- Semantic drift β
teachaction now accepts an optionalmeaningparam from the LLM, enabling telephone-game meaning evolution. Previously meanings were copied verbatim. invent_wordβ No longer blocked by vocabulary; agents can assign their own meanings to words heard in speech.
| Direction | How | Key Files |
|---|---|---|
| π§ Better memory | Implement consolidation, decay, narrative compression | core/agent.py, memory/memory_store.py |
| π Richer world | Add dynamic events, seasons, obstacles, NPCs, terrain types | core/world.py |
| π€ Different LLM | Subclass MistralBridge for any OpenAI-compatible API |
cognition/mistral_bridge.py |
| π New metrics | Add custom metrics to MetricsCollector.collect() |
metrics/collector.py |
| π Agent heterogeneity | Vary capabilities, personality distributions, initial resources | core/agent.py, cognition/prompts.py |
| π Cultural evolution | Implement prestige bias, conformity, teaching fidelity, status effects | cognition/cognition_service.py |
| π Statistical rigour | Run experiments/runner.py with multiple seeds and conditions |
experiments/runner.py |
| π³οΈ New governance | Add ranked-choice voting, delegate systems, constitutional evolution | cognition/proposal_system.py |
| π Social network topology | Constrain communication to network edges (small-world, scale-free, etc.) | core/simulation.py |
| π§ͺ Experiment library | Add new experiment configurations in experiments/ |
experiments/runner.py |
| π Semantic drift | Track meaning evolution; LLM reinterprets on teach β telephone-game effect | experiments/semantic_drift.py, core/simulation.py |
| π¦ Contagion analysis | Fit SIR models to norm/word adoption curves; critical mass detection | experiments/contagion.py |
| π Auto-reporting | LaTeX table generation + matplotlib plots from experiment output | papers/generate_report.py, scripts/plot_results.py |
| β‘ Parallel execution | Multi-process seed runner for faster experimentation | experiments/parallel_runner.py |
MIT β free for any use, commercial or academic.
π Report a bug Β· π‘ Start a discussion Β· β Star the repo
Built with Python Β· Mistral API Β· Flask Β· inspired by Stanford's Generative Agents and the naming game tradition