Reference Sheet 11: Information Theory & Model Complexity#

Target audience: Forge auditor who knows S5 modal logic, CEM, FOL, basic game theory, and the contents of Sheets 1–10 but needs information theory to assess axiom redundancy, compare model formulations, and measure the intrinsic complexity of the matheology system.

1. Orientation#

The matheology system has 25 axioms, 11 theorems, 2 models, and 66+ HELL findings. Is this the simplest system that captures the intended content, or is there redundancy? Are some axioms doing double duty while others contribute little? When two formulations of the same axiom exist, which is “better”? These are questions about information — how much content each component carries, how much overlap exists between components, and what the minimum description of the system is.

Information theory provides three tools for this:

Entropy — measuring how much information a component carries.
Mutual information — measuring how much two components share.
Complexity and description length — measuring the intrinsic size of the system and comparing it to alternatives.

None of the existing 10 sheets address model comparison, redundancy detection, or parsimony. Category theory (Sheet 1) can detect structural redundancy (isomorphic objects); information theory detects quantitative redundancy (near-duplication, partial overlap, compression potential).

2. Key Concepts#

Shannon entropy. For a discrete random variable X with probability distribution p(x₁),…,p(xₙ), the entropy is H(X) = −Σ p(xᵢ) log₂ p(xᵢ). It measures the average information content per outcome — or equivalently, the minimum average number of binary questions needed to determine X’s value. Maximum entropy = maximum uncertainty; zero entropy = completely determined.

Matheology use: If axioms are treated as “messages” in a formal system, the entropy of the axiom system measures how much information it contains. A system where all axioms are equally “surprising” (uniform distribution of derivable content) has maximum entropy. A system where one axiom dominates (most theorems follow from ax1 alone) has lower entropy — and may have redundant axioms.

Conditional entropy and mutual information. Conditional entropy H(Y|X) measures the remaining uncertainty in Y once X is known. Mutual information I(X;Y) = H(Y) − H(Y|X) = H(X) − H(X|Y) measures how much knowing X tells you about Y (and vice versa). If I(X;Y) = 0, X and Y are independent; if I(X;Y) = H(Y), X completely determines Y.

Matheology use: I(A_i; A_j) between two axioms measures their informational overlap. If I(ax1; ax2) is large, they are informationally redundant — one could potentially be derived from the other (or both from a simpler shared source). If I(A_i; T_j) = H(T_j), then axiom A_i completely determines theorem T_j — that axiom is the sole informational source for that theorem.

Kolmogorov complexity. For a string s, the Kolmogorov complexity K(s) is the length of the shortest program (on a fixed universal Turing machine) that outputs s. It measures the intrinsic complexity of an object — how much it can be compressed. K(s) is uncomputable (no algorithm can compute it for all s) but can be approximated from above (any specific compression is an upper bound).

Matheology use: K(axiom-system) measures the intrinsic complexity of the 25-axiom system. If a reformulation with 20 axioms produces the same theorems, and K(20-axiom system) < K(25-axiom system), the shorter formulation is objectively simpler — not merely syntactically shorter but informationally more compressed.

Minimum Description Length (MDL). A model-selection principle: the model that achieves the shortest total description of data + model is preferred. Total description length = L(model) + L(data | model), where L(model) is the complexity of the model specification and L(data | model) is the description length of the data encoded using the model. MDL formalizes Occam’s razor: prefer the simplest explanation that fits the evidence.

Matheology use: The “data” is the set of phenomena the matheology system accounts for (scriptural convergence, theodicy structure, economic mechanism properties). The “model” is the axiom system. MDL asks: is 25 axioms the shortest description of this data, or could 18 or 30 be shorter? (Fewer axioms = shorter model but possibly longer data-encoding; more axioms = longer model but possibly shorter data-encoding due to richer derivation power.)

Algorithmic mutual information. I_K(x;y) = K(x) + K(y) − K(x,y) measures the informational overlap between objects x and y in the Kolmogorov framework. If I_K(PET; JUB) is large, the two models share significant structural information — which should be factored out rather than repeated.

Matheology use: The alignment echoes between PET and JUB are symptoms of high algorithmic mutual information. Factoring this out means identifying the shared core (the axioms common to both, or the shared structure that both instantiate). The shared core’s complexity measures how “deep” the alignment is.

Rate-distortion theory. Given a source X and a distortion measure d, the rate-distortion function R(D) gives the minimum number of bits needed to describe X with average distortion ≤ D. At D = 0 (lossless), R(0) = H(X). As D increases (more distortion tolerated), R(D) decreases.

Matheology use: The “easy” and “expert” audience-depth views in the BEST naming system (5D label space) are lossy compressions of the full model. Rate-distortion theory measures the tradeoff: how much information does the “easy” view lose (distortion D) and how much shorter is it (rate R)? This formalizes the compiler’s audience-depth trade-off.

Conditional Kolmogorov complexity. K(y|x) = length of the shortest program that outputs y given x as input. It measures how much additional information y contains beyond what is already in x. K(JUB | PET) measures how much genuinely new content the JUB model adds beyond PET.

Matheology use: K(JUB | PET) quantifies the “novelty” of the JUB extension. If K(JUB | PET) is small relative to K(JUB), then JUB mostly repackages PET content. If large, JUB contains substantial new information. Similarly, K(th8 | ax15,…,ax25) measures how much th8 adds beyond what the axioms already state — a near-zero value means th8 is essentially a restatement.

3. Critical Theorems#

Shannon’s source coding theorem. A source with entropy H can be compressed to H bits per symbol and no fewer. Any compression below H loses information (lossy); any encoding above H is wasteful (redundant). Why it matters: The axiom system has an intrinsic information rate. If the current 25-axiom formulation uses more “bits” than necessary (some axioms partially duplicate others), the system is over-specified. Shannon’s theorem sets the theoretical minimum.

Kolmogorov’s incompressibility theorem. For any encoding, the fraction of strings of length n that can be compressed to length n − c is at most 2^{−c}. Most strings are incompressible. Informally: randomness is the norm; compressibility is the exception that signals structure. Why it matters: If the axiom system is highly compressible (its Kolmogorov complexity is much less than its syntactic length), there is deep structure that a more elegant formulation could capture. If it is nearly incompressible, the axioms are already near-optimal — each one contributes irreducible content.

Data processing inequality. If X → Y → Z is a Markov chain, then I(X;Z) ≤ I(X;Y). Processing data can only destroy information, never create it. Why it matters: The compilation pipeline (source axioms → SISYF → compiled views → audience-depth projections) is a processing chain. At each stage, information about the original axioms can only decrease. The data processing inequality quantifies maximum information loss at each compilation stage — and identifies the bottleneck where most information is lost.

Noiseless coding theorem (Kraft inequality). A uniquely decodable code exists for a source if and only if the codeword lengths l_i satisfy Σ 2^{−l_i} ≤ 1. The optimal code assigns length l_i ≈ −log₂ p(x_i) to symbol x_i. Why it matters: The BEST naming convention (5D label space) is a code for addressing model elements. The Kraft inequality constrains how compact this naming system can be. If the current naming is longer than the Kraft bound, there is room for a more efficient naming convention.

Solomonoff induction / Universal prior. The probability of a string x under the universal prior is m(x) = Σ 2^{−|p|} where the sum is over all programs p that output x. This assigns higher probability to simpler (shorter program) objects. Why it matters: If two axiom formulations produce the same theorems, Solomonoff’s universal prior prefers the simpler one — the one with lower Kolmogorov complexity. This is a formally grounded version of Occam’s razor that applies directly to axiom system design.

4. Common Pitfalls#

Treating Kolmogorov complexity as computable. K(s) is uncomputable — there is no algorithm that computes K(s) for all s. You can only upper-bound K(s) by exhibiting a specific compression. Claims about “the” Kolmogorov complexity of the axiom system are approximations, not exact values. Use compression algorithms (gzip, LZ77) as practical upper bounds.

Confusing syntactic length with informational content. A verbose axiom statement may carry the same information as a terse one. Syntactic length (character count) ≠ Kolmogorov complexity (minimum program length) ≠ Shannon entropy (information rate). A 200-word axiom may be informationally equivalent to a 50-word one if the extra 150 words are redundant elaboration.

Ignoring the model in MDL. MDL measures model + data complexity jointly. A simpler model that fits the data poorly (high L(data | model)) may have greater total description length than a complex model with excellent fit. The goal is minimum total length, not minimum model complexity alone. Stripping axioms until the system is “simple” may increase total complexity if the remaining axioms cannot account for the phenomena.

Treating mutual information as causation. I(A_i; A_j) > 0 means axioms A_i and A_j share information — not that one causes or implies the other. The shared information may come from a common source (both derive from a deeper principle) rather than a direct logical dependency.

Applying continuous entropy to discrete objects. Shannon entropy for continuous variables (differential entropy) can be negative and lacks the “minimum bits” interpretation. The matheology system’s axioms are discrete objects; use discrete entropy. If you model parameters as continuous (e.g., the redistribution rate in ax25), use differential entropy with appropriate caveats.

5. Bridge to Matheology#

Axiom redundancy analysis. Compute (or approximate) pairwise mutual information I(A_i; A_j) for all 300 axiom pairs (25 choose 2). High mutual information pairs are candidates for merging or for identifying a shared deeper principle. Example: if I(ax1; ax4) ≈ H(ax4), then ax4 (Universal Immanence) is almost completely determined by ax1 (Containment) — and may be derivable as a theorem rather than stated as an axiom.

PET vs. JUB independence. Compute K(JUB | PET). If this is close to K(JUB), the two models are informationally independent — JUB contributes genuinely new content. If close to 0, JUB mostly repackages PET, and the 11 “new” axioms may be derivable from the first 14.

Model comparison across versions. For two versions of the same model (PPv1r0p0 vs. PPv1r1p1), compute K(V2 | V1) — the “edit complexity” or informational novelty of the revision. Small K(V2 | V1) means a minor refinement; large means a significant restructuring. This quantifies what the VVN system (Sheet 2) tracks qualitatively.

HELL findings as information. Each HELL finding carries information about the axiom system’s weaknesses. The total information in the HELL database H(HELL) measures how much adversarial content the system has absorbed. I(HELL; Axioms) measures how much of this information actually pertains to the axioms (vs. noise or tangential objections).

Compression as a formalization guide. If the 25-axiom system compresses to (say) 60% of its syntactic length, the 40% redundancy indicates where simplification is possible. The compression algorithm’s dictionary reveals which phrases and structures are repeated — pointing directly at the shared patterns that could be factored into definitions or lemmas.

Rate-distortion for audience depth. The “easy” view of an axiom is a lossy compression. Rate-distortion theory gives the minimum information loss for a given compression ratio. If the easy view loses more information than the rate-distortion bound requires, the compression is suboptimal — a better “easy” summary exists that preserves more content at the same length.

New questions information theory enables:

What is the approximate Kolmogorov complexity of the 25-axiom system? (Upper-bounded by compressing the axiom text with a standard algorithm.)
Which axiom pair has highest mutual information? This identifies the strongest candidate for redundancy or for a shared underlying principle.
Is the PET+JUB combined system more or less complex than the sum of its parts? (If K(PET+JUB) < K(PET) + K(JUB), there is shared structure being double-counted.)
What is the information-theoretic “cost” of adding each HELL finding to the system? Findings that add little information (high redundancy with existing findings) may be deprioritized.