Name	Name	Last commit message	Last commit date
parent directory ..
prompts	prompts
starter	starter
submissions	submissions
CHALLENGE.md	CHALLENGE.md
README.md	README.md
evaluate.py	evaluate.py
setup.py	setup.py

Text to SQL - Execution Accuracy 0.650 → 0.690 (+6%)

A pattern for narrowing the gap on structurally-sound baselines through candidate voting, demonstrated on SQL generation across healthcare, finance, retail, and logistics schemas. The same shape applies to any task where single-pass generation is already close to ceiling and the remaining errors are stochastic, not systematic.

Why the last 6% is hard

Text-to-SQL with a good prompt and few-shot examples already gets most queries right. The remaining failures share a property:

Stochastic errors, not systematic ones. The model knows the right SQL pattern but occasionally picks the wrong join column, misreads a filter condition, or chooses a subtly wrong aggregation. More prompt engineering doesn't fix errors that aren't consistent.

The instance here: 100 natural language questions over database schemas ranging from single-table lookups to multi-table joins with subqueries and window functions, scored by execution accuracy (run both queries, compare result sets). The baseline agent -- five fixed few-shot examples, detailed SQLite-specific instructions, temperature 0 -- scored 0.650.

Paper Lantern surveyed 96 papers and surfaced three that address stochastic errors through candidate generation and voting: SQL-PaLM showed +4.9% from execution-based consistency decoding, MCS-SQL reached 89.6% on Spider via multi-prompt voting, and PET-SQL hit 87.6% via cross-consistency across low-temperature runs. The PL agent tested five approaches before finding that self-consistency voting -- 7 candidates at temp=0.8, majority vote after SQL normalization -- moved accuracy to 0.690.

Baseline: few-shot prompting, one pass

prompt = build_prompt(
    question=question,
    db_schema=db_schema,                    # CREATE TABLE statements
    sample_rows=sample_rows,                # INSERT statements
    few_shot=five_curated_examples,         # diverse SQL patterns
)
answer = gemini.generate(prompt, temperature=0)
sql = extract_sql(answer)

One call per question, temperature 0, five hand-picked examples covering WHERE filters, GROUP BY, JOINs, percentage calculations, and correlated subqueries. Detailed system prompt with SQLite-specific rules. This is a solid baseline -- it handles 65 of 100 questions correctly. The failures aren't clustered around a single SQL pattern; they're scattered across question types. That's the signal that the problem is variance, not a missing capability. (baseline/solve.py)

Five approaches before the winner

The PL agent didn't find self-consistency on the first try. It tested five approaches before landing on the winner -- and understanding why each failed is the real takeaway.

Approach	Score	Why it failed
PreSQL + category few-shot + 3 candidates + majority vote	0.600	Multiple prompt variants fragmented the vote -- different styles produce different SQL that splits the majority
2 candidates + LLM judge	0.660	The judge adds noise on disagreements. When two candidates differ, the judge doesn't have execution feedback to break the tie
PreSQL + single enriched pass	0.610	Dynamic few-shot selection alone is weaker than fixed curated examples -- retrieval noise outweighs relevance gains
Self-refinement (generate, then verify/fix)	0.640	Without execution feedback, the model "fixes" correct queries into broken ones as often as it fixes broken ones
Self-consistency: 7 candidates, temp=0.8, majority vote	0.690	Same prompt, varied sampling -- correct SQL converges, errors scatter

The key insight from SQL-PaLM and PET-SQL: when errors are stochastic, the correct answer appears more often than any single incorrect answer across multiple samples. Self-consistency exploits this directly. Multi-prompt diversity (MCS-SQL) works at scale but fragments votes with only 7 candidates.

The winning approach generates 7 candidates per question at temperature 0.8 using the same prompt (5 fixed + 3 TF-IDF-retrieved few-shot examples), normalizes each SQL (uppercase, collapse whitespace, unify quotes), and picks the most common form. The correct SQL tends to converge; incorrect SQL scatters across different mistakes. (with_pl/solve.py)

What to steal

When errors are stochastic, vote instead of refining. If your failures aren't clustered around a single pattern, self-consistency (same prompt, varied temperature) outperforms self-refinement (generate then fix). Refinement without execution feedback introduces as many errors as it corrects.
Same prompt, varied sampling beats multiple prompts. Different prompt styles produce structurally different SQL that splits the vote. Temperature variation with one prompt keeps the style consistent while varying the logic -- the majority signal stays clean.
Know when to stop searching for a better approach. The PL agent tried five techniques. Four scored between 0.600 and 0.660 -- all worse than the baseline. The winner was a 7-candidate vote, the simplest of the five ideas. When the baseline is structurally sound, marginal gains come from reducing variance, not from architectural shifts.
Normalize before you vote. SQL has many equivalent surface forms (SELECT a, b vs SELECT a,b, single vs double quotes). Without normalization, identical queries split across formatting variants and the majority signal dissolves.

Results and honest cost

	Baseline	With PL	Delta
Execution Accuracy	0.650	0.690	+6%
API calls	100	700	7x
Tokens	108K	1.16M	11x
Runtime	161s	417s	2.6x

The baseline was already structurally sound -- good prompt, good few-shot examples, deterministic generation. The +6% comes from converting stochastic single-pass errors into a voting signal across 7 candidates. The 7x cost increase is modest in absolute terms (both runs cost well under $1), but the gain-per-dollar is thin. This is the right tradeoff when accuracy matters more than cost, and the wrong one when it doesn't.

Where this wouldn't help

Systematic errors -- if the model consistently mishandles a SQL pattern (e.g., always gets window functions wrong), voting 7 times produces 7 copies of the same mistake. Schema ambiguity -- if the question is genuinely ambiguous given the schema, no amount of sampling resolves it. Tight cost budgets at scale -- 7x calls per question adds up when you're generating thousands of queries per hour.

Reproduce

The only difference between the two agents was a single section of their prompt -- "use your own knowledge" vs. "use Paper Lantern MCP tools to research first." Everything else downstream was the agent's doing.

uv run python run.py setup text_to_sql
uv run python run.py run text_to_sql baseline
uv run python run.py run text_to_sql with_pl
uv run python run.py compare text_to_sql
diff prompts/baseline.md prompts/with_pl.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Text to SQL - Execution Accuracy 0.650 → 0.690 (+6%)

Why the last 6% is hard

Baseline: few-shot prompting, one pass

Five approaches before the winner

What to steal

Results and honest cost

Where this wouldn't help

Reproduce

FilesExpand file tree

text_to_sql

Directory actions

More options

Directory actions

More options

Latest commit

History

text_to_sql

Folders and files

parent directory

README.md

Text to SQL - Execution Accuracy 0.650 → 0.690 (+6%)

Why the last 6% is hard

Baseline: few-shot prompting, one pass

Five approaches before the winner

What to steal

Results and honest cost

Where this wouldn't help

Reproduce