DD: Model Forge Design Decisions#

Created: 2026m03d27

This document records design decisions behind the Model Forge prompts. Each decision is numbered, dated, and includes the reasoning so that future prompt revisions can judge whether the decision still holds.

DD-1: Symbol table redundancy is intentional (2026m03d27)#

Decision: Load all three symbol files (symbols/index.rst, pet/symbols.rst, jub/symbols.rst) even though the grouped master index is a combination of the individual ones.

Cost: ~450 tokens of redundancy (the index is tiny).

Reasoning: The redundancy serves two distinct functions at two different moments in the agent’s reading sequence:

``pet/symbols.rst`` + ``jub/symbols.rst`` (loaded first): These are the decoding key — loaded before the axioms so that when the agent encounters P(x,y) or D_free in the expert axiom file, it already knows what they mean. This is the “learn the alphabet” step.
``symbols/index.rst`` (the grouped master): Loaded as a cross-model integration check — it shows how PET and JUB symbols relate to each other in one view. The agent sees the same symbols a second time but now in a comparative frame.

The reinforcement effect is real: symbols that appear twice are more reliably retained in the working context, especially in a 200K window where the symbols are read early and the actual model work happens 100K+ tokens later.

Verdict: Keep the redundancy. ~450 tokens is cheap insurance for the most load-bearing tokens in the context. The risk of the agent silently misinterpreting a symbol deep into Phase 3 far outweighs the cost.

DD-2: Anti-echo-chamber firewall via phased loading (2026m03d27)#

Decision: The forge agent must form its independent assessment (Phase 1 / Seed) before seeing the user’s model (Phase 2 / Feed). The user’s model input is deliberately withheld from the initial context load.

Reasoning: If the agent reads the user’s informal model ideas at the same time as the formal foundations, it will pattern-match the new model to the existing structure and tend to confirm rather than challenge. The phased loading creates a natural adversarial checkpoint:

After Phase 1, the agent has expectations about what a new model should look like, where it will face resistance, and what gaps exist.
When the user’s model arrives in Phase 2, the agent compares it against those expectations. Violations of expectations are where the interesting bugs live.

This is analogous to blinding in experimental design: the assessor forms a hypothesis before seeing the data.

DD-3: StayC lifecycle replaces ad-hoc verdict scales (2026m03d27)#

Decision: v2 prompts use the StayC maturity lifecycle (MM through SS) as the ONLY verdict system, replacing the ad-hoc IRON/STEEL/ COPPER/SLAG scale from v1.

Reasoning: The v1 scale was invented on the spot during prompt drafting. It duplicated functionality that StayC already provides, but without the lifecycle semantics (NN as death valley with hope of rescue, JJ/KK as the structured path for blockers and terminal failures), the VVN attribution, or the iteration cycle formalism (QQv1 → NN_QQv1 → QQv2). Using StayC means:

Every verdict is comparable across sessions, models, and assessors.
The agent’s assessments carry VVN attribution (dv_ClaOp46Max_).
The NN → JJ → KK pipeline provides structured handling of failures instead of a binary “discard or keep.”

v1 to StayC mapping (for reference):

v1 Verdict	StayC Equivalent	Notes
IRON	PP (or higher)	Formal statement + proof, all tests HELD
STEEL	OO	Minor issues, repairable without restructuring
COPPER	NN (with rescue potential)	Significant issues, needs feeding to reach OO
SLAG	NN → JJ → KK	Fundamental flaw; assess if terminal (KK) or rescuable (JJ)

DD-4: Reference sheets produced in separate session (2026m03d27)#

Decision: The pre-forge reference sheet generator runs in a separate session from the forge itself. Sheets are saved as files and loaded by the forge prompt.

Reasoning:

Different task type: Producing good reference sheets requires reading textbook-level material and distilling it. This is a synthesis task, not an adversarial testing task. Mixing them wastes forge working space on reference generation.
Avoids framing bias: If the forge agent produces its own reference sheets, it may unconsciously frame the summaries to support conclusions it has already formed. Sheets produced by a separate session arrive as neutral reference material.
Reusable: Sheets are produced once, reviewed by the user, and reused across many forge sessions. The cost is amortized.
Selective loading: Not every forge session needs all 4 reference areas. The forge prompt loads wb/*.rst — the user controls which sheets are present by what files exist in that directory.

DD-5: OKScale replaces HELD/BREACH/N/A ternary (2026m03d27)#

Decision: Replace the three-state verdict system (HELD / BREACH / N/A) with the four-state OKScale (OK / KO / OKO / MIS). HELD and BREACH are retained as synonyms for OK and KO in narrative text.

Reasoning: The three-state system conflates two fundamentally different situations under “N/A”:

OKO (undetermined): The test was correctly applied but the outcome genuinely cannot be determined — either because more information is needed (temporal block) or because the question is formally undecidable (principled block). This is an honest “I tried and could not resolve it.”
MIS (misclassified/misapplied/mistake missed): The test was incorrectly applied, an earlier verdict was wrong, or a flaw was missed entirely. This is a self-correction mechanism.

Conflating these under “N/A” means the audit trail cannot distinguish between “we genuinely don’t know” and “we made a mistake.” For a system whose integrity depends on honest assessment, that distinction is load-bearing.

Origin: The OKScale and its BioBinary data type were designed by LLoL for Evolvix. In biology, gene function is rarely cleanly binary: functional (OK), nonfunctional (KO), conditionally expressed (OKO), or misannotated (MIS). The same four states apply to formal claims under adversarial testing.

Full specification: OKScale — BioBinary Verdicts for Adversarial Testing.

On keeping HELD/BREACH as synonyms: HELD and BREACH are embedded in CLAUDE.md’s language rules, across 66 HELL findings, and in every SOCIAL-CARD-REVIEW block (245 files). Replacing them as primary terms would introduce a translation cost across the entire codebase. The right move: OK/KO/OKO/MIS as the formal OKScale data type, HELD/BREACH as domain-specific synonyms valid in narrative and HELL infrastructure. This was a MIS on the auditor’s part — the cost of the naming change should have been flagged before accepting it.

DD-6: 2-track VVN and human advancement authority (2026m03d27)#

Decision: Stage advancement in StayC is a human decision. The machine (Claude) may propose advancement but does not decide it. Both human (iv_) and machine (dv_) VVNs are recorded independently. The human track governs for publication.

Reasoning: If the machine could unilaterally advance claims, the system degrades into rubber-stamping: the machine proposes and approves its own assessments. The human veto ensures that every advancement reflects genuine human judgment, not just machine pattern-matching.

Conversely, the machine must be able to insist on its assessment (including downward assessments like NN) even when the human disagrees. If the machine silently defers to human preference, the adversarial function is lost. The 2-track system resolves this: both assessments stand in the record, divergence is flagged as data, and the human track governs for external-facing decisions.

Divergence as signal: When iv_ and dv_ disagree on a claim’s StayC level, that claim deserves extra scrutiny. The disagreement identifies exactly where human intuition and machine analysis see different things — which is precisely where the interesting bugs tend to live.

DD-7: Iteration cycles generalized across all stages (2026m03d27)#

Decision: The version/rejection/revision iteration cycle (XXv1 → NN_XXv1 → XXv2) applies at every StayC stage (OO, PP, QQ, RR), not just QQ.

Reasoning: v1 and the initial v2 prompts described the iteration cycle only for QQ (adversarial quest). But refinement happens at every stage:

OO: The MVP is wobbly; each iteration makes it less so.
PP: The proof has gaps; each iteration closes one.
QQ: External critics find weaknesses; each iteration addresses one.
RR: Broad reviewers find issues; each iteration resolves one.

The same VVN machinery (version numbering, NN cross-referencing, chain documentation) works identically at all stages. Restricting it to QQ was an artificial limitation that would have forced claims to reach QQ before they could be iteratively refined — which contradicts the death-valley feeding model where NN → OO requires exactly this kind of iteration.

DD-8: Feed nurtures, Grow tests — never the reverse (2026m03d27)#

Decision: Phase 2 (Feed) is collaborative formalization where the auditor helps informal ideas grow into formal statements. Phase 3 (Grow) is where the Iron Maiden stress-tests formalized claims. The Iron Maiden should be used with caution on raw MM-stage ideas — gentle steering in light of the tests that will come later is desirable, but full adversarial testing at MM risks needlessly killing ideas that are immature, not invalid. What should not be fed is ideas that are known to fail.

Reasoning: The v2 prompt (before this correction) had the Iron Maiden in Phase 2, immediately after the user shared informal ideas. This is equivalent to putting a seedling through a hurricane: the seedling dies not because it was a bad seed but because it was tested before it was ready. The result would be needlessly killing ideas for the wrong reasons — immaturity, not invalidity.

The Seed/Feed/Grow/Reap lifecycle maps directly:

Feed (Phase 2): Gardening. Soil, water, light. The auditor uses its knowledge of formal logic, mathematics, and the existing system to help the user express intuitions as formal statements. Claims move from MM to OO to PP. Potential problems are flagged gently as “things to watch for” — not as kill shots.
Grow (Phase 3): Trial by fire. The Iron Maiden opens only when claims are at OO or above and both the user and auditor agree they are ready for stress-testing. Claims that fail may return to Phase 2 for more feeding — the Feed ↔ Grow cycle is normal and expected.

The joint decision to move from Feed to Grow is critical. Neither side should rush the other. The auditor who pushes for testing too early wastes the user’s ideas. The user who resists testing too long risks building on unfounded assumptions. Both must agree.

Key insight: Most good ideas look fragile at MM. That fragility is immaturity, not invalidity. The auditor’s job in Feed is to see past the fragility to the underlying structure and help it grow.

DD-9: Pre-seeding HELL with development-phase insights (2026m03d27)#

Decision: When the auditor discovers adversarial insights during Phase 2 (Feed) or Phase 3 (Grow) that are strong enough to constitute independent objections or defenses, these should be drafted as con/pro entries in the Reap output — ready for HELL integration before any external adversarial critique begins.

Reasoning: The Iron Maiden testing in Phase 3 is itself a form of adversarial review. If the auditor discovers a serious attack vector (even one that was successfully defended), that attack and its defense are valuable HELL content. Waiting for an external critic to independently rediscover the same attack wastes time and risks the external critic finding the attack without the defense.

What to pre-seed:

Con entries: Any KO result from the Iron Maiden that revealed a genuine structural weakness — even if it was subsequently repaired. The original attack remains valid against the original formulation and may apply to future revisions.
Pro entries: Any defense that resolved a KO — especially if the defense required a non-obvious insight. This pre-arms future defenders.
OKO entries: Undetermined results that could not be resolved during development. These are especially valuable: they tell future critics exactly where the open questions are, preventing wasted effort on already-explored territory.

Where they go: The forge Reap phase (Phase 4) should include a section: “Proposed HELL entries” with draft con and pro texts ready for integration into the model’s HELL structure. The human decides which to publish.

DD-10: Structural documentation enforcement via LLog protocol (2026m03d27)#

Decision: FORGE sessions are documented via a mandatory LLog protocol with named commands (IGNITE, HEAT, STRIKE, TEMPER, QUENCH, ROUND, BANK, EMBER) that enforce append-only logging as a side effect of doing the work — not as an optional extra step.

Problem: Claude documents well when explicitly reminded, then forgets in the next session. The user must constantly re-explain that documentation matters. This produces incomplete records and wastes the user’s effort on reminders rather than content.

Reasoning:

Structural, not voluntary: The documentation requirements are embedded in the forge prompt itself (forge_1m.rst and forge_200k.rst). Any fresh Claude session that loads the prompt inherits the protocol automatically. No prior conversation history needed.
Side effect, not extra step: Phase transition commands (HEAT, STRIKE, etc.) produce LLog entries as part of their execution. The agent does not “do the work, then document it” — the documentation IS part of doing the work.
Verbatim prompts, always: The user’s exact words are the most valuable part of the audit trail. Summaries can be wrong; abbreviations lose nuance; paraphrases introduce the documenter’s bias. The verbatim prompt is the ground truth.
Recovery from interruption: The EMBER command enables session continuity across context windows. Because the LLog is written to disk at every response, context exhaustion does not lose work — only the current in-context state is lost, and EMBER reconstructs it from the LLog.
External review: The LLog is readable by anyone. A reviewer who disagrees with the conclusions can read the exact sequence of prompts, responses, and decisions that produced them. This is essential for the system’s credibility — mathematical results are only as trustworthy as their derivation is transparent.

Enforcement mechanism: Rule 1 of the protocol states: “No response without a log entry.” This is the load-bearing rule. If the agent responds without logging, the response is unrecoverable if the session is interrupted. Making logging the first action (before generating the response content) ensures the audit trail is never behind.

Full specification: FORGE Session Documentation Protocol.

DD-11: WisdomBase (wb/) replaces ref/ for reference sheets (2026m03d27)#

Decision: The ref/ directory holding FORGE reference sheets is renamed to wb/ (WisdomBase), adopting the naming convention from Evolvix.

Reasoning: WisdomBase is an established term in the Evolvix ecosystem for distilled generic wisdom from diverse disciplines. The FORGE reference sheets are exactly this: distilled, discipline-specific wisdom (category theory, dynamical systems, etc.) organized for applied use. Adopting the WB convention:

Aligns FORGE with Evolvix naming conventions
Distinguishes the distilled wisdom (wb/) from the generators that produce it (pre-forge-compiler-refsheet*.rst)
Signals that the sheets are a knowledge base, not just references

Pre-forge scripts stay outside wb/: The pre-forge compiler prompts remain in the forge/ root directory because they are the machinery that produces WB content, not the content itself. Keeping them separate maintains the generator/product distinction.

DD-12: Delayed counting and underscores for LLog numbering (2026m03d27)#

Decision: LLog numbering uses the HELL-compatible delayed counting scheme (a1–a9, b10–b99, c100–c999) for sessions, rounds, and entries. Session entry labels and filenames use underscores as separators. Dates use YYYYmMMdDD format. The date in a session ID is the start date — if a session spans multiple days, it keeps the date of its FORGE:IGNITE.

Examples:

Session directory: sa1_2026m03d27/
Entry label: forge_sa1_2026m03d27_ra1_heat_ea1
Display: Forge_Sa1_2026m03d27 | Round a1 | HEAT | Entry a1

The ``forge_`` prefix is a namespace. PROMY will use promy_, SISYF will use sisyf_. This prevents label collisions when multiple compilers have LLog sessions.

Why delayed counting (not zero-padded fixed-width):

The fixed-width scheme (s001, s002) wastes a digit on the placeholder 0 for every number below 10 and breaks sort order across width boundaries (s9 < s10 fails in lexicographic sort, but sa9 < sb10 succeeds because a < b). Delayed counting is strictly better: no wasted digit, correct sort order within each prefix group, and graceful scaling from single-digit to thousand-plus entries. Using the same scheme as HELL ensures a single counting convention across the entire matheology system.

Why underscores (not hyphens) for session IDs:

The initial design used hyphens for consistency with existing RST labels (compiler-forge, forge-llog-protocol). This was revised after recognizing that session IDs serve a fundamentally different purpose than structural labels.

Unique-token searchability: Search engines treat underscores as part of a single token. forge_sa1_2026m03d27_ra1_heat_ea1 is one opaque string that returns exactly one Google hit — the specific LLog entry being discussed. By contrast, hyphens would fragment it into common fragments (forge, sa1, 2026m03d27) that drown in noise. For a system that aims to support global academic discussion about specific formal reasoning steps, identifier discoverability is more valuable than word discoverability.
Consistency with VVN: The VVN system uses underscores throughout (iv_LLoL_OOv1r0p0_2026m03d27). Session IDs live in the same namespace of “unique searchable identifiers.” One convention, not two.
The rare dateformat amplifies uniqueness: 2026m03d27 is already unusual (most systems use 2026m03d27). Combined with forge_sa1_ as prefix, the full token is effectively globally unique. Google becomes an index into the matheology discussion space.
LaTeX escaping is handled by Sphinx: The rst2latex pipeline escapes underscores automatically. No RST user ever encounters \_. Only raw-LaTeX authors would, and they are expert enough to handle a backslash.

Two label conventions coexist:

Structural labels (infrastructure, few, static): hyphens. compiler-forge, forge-llog-protocol, forge-aha-quickstart. These are page anchors, not search targets.
Session entry labels (dynamic, many, searchable): underscores. forge_sa1_2026m03d27_ra1_heat_ea1. These are unique identifiers that recruit search engines as discussion indices.

The distinction is clean: structural labels name pages; session labels name moments in a reasoning process.

Human-readable display: In human-facing LLog text, session IDs use the capitalized POST convention: Forge_Sa1_2026m03d27. In RST label directives, they are lowercase: forge_sa1_2026m03d27.