preprint / code world models ii / tiny-jepa · 1.3 M params

From Early-Loop Signal to Readout Repair Deep Supervision and VICReg in Tiny JEPA Code Models

The stabilized tiny JEPA encoder compresses transition signal $12\times$ across loops. Direct delta prediction is a genuine negative result. Deep supervision on intermediate loops preserves the signal, but high lift can mask a collapsed readout. VICReg at the readout delivers ~20× discriminability improvement and makes a 1.3 M parameter model match 125.9 M UnixCoder on CommitPackFT edit retrieval at 10× lower CPU latency.

0x01 Video companion

A ten-scene Manim walk-through covering the signal geometry, deep-supervision, VICReg readout repair, and the foundation-model head-to-head. This is a curated Paper 1 → Paper 2 narrative. Embedded at 480p15 for quick viewing; download the HD build for presentations.

scene 01–09 · 10 scenes · manim community ↓ 1080p60 · 5.6 MB

0x02 Paper 1 foundation

This paper builds on the near-static target encoder fix from Paper 1. That work identified a ${\sim}700$-step training ceiling caused by target-encoder collapse and showed that holding the target approximately static enables stable 15K-step training at 1.1 M parameters.

P1 Diagnosing and Fixing Encoder Collapse in Tiny JEPA Code Transition Models Akbulut, 2026 · Copy-baseline diagnostic, near-static EMA fix, 0.6 MB Rust/WASM deployment

0x03 Abstract

Stabilized tiny JEPA code models prevent training collapse, but leave open where transition information lives inside the model. Geometry probes show that the six-loop encoder compresses transition signal by roughly $12\times$ across loops into an effective rank of $5/128$, and direct delta prediction remains a genuine negative result: consecutive code states are too similar to serve as a useful target.

Predictive deep supervision on intermediate loops raises lift over a copy baseline by more than $+1.0$, but a discriminability probe reveals the caveat: the high-lift readout collapses to $1.95/128$ effective rank with near-random KNN, so lift alone is gameable. VICReg at the readout resolves this collapse, delivering $\sim\!20\times$ improvement in readout discriminability on a full-validation audit ($N = 5000$), peaking at step 28K before monotonic overfit.

At $1.3$ M parameters, CodeWM matches $125.9$ M UnixCoder on CommitPackFT edit retrieval at $10\times$ lower CPU latency. Cross-seed runs show the peak is a single-seed optimum with $2\text{--}3\times$ KNN variance: the encoder is consistently repaired across seeds, but predictor effective rank ($5\text{--}6/128$) is the binding bottleneck.

0x04 Key contributions

  1. 01

    Signal compressed $12\times$ across loops.

    Delta/state ratio drops $0.123 \to 0.010$ across six loops. Early loops carry transition information; the final readout is nearly isotropic at effective rank $5/128$.

  2. 02

    Direct delta prediction is a negative result.

    Explicit $z_{t+1} - z_t$ prediction fails: consecutive code states are too similar in the target space to serve as a useful target.

  3. 03

    Deep supervision preserves intermediate signal.

    Predictive auxiliary losses on loops 1 and 2 keep intermediate representations informative. Lift crosses $+1.0$ over the copy baseline.

  4. 04

    Lift alone is gameable.

    A high-lift checkpoint collapsed the readout to $1.95/128$ effective rank with KNN@1 near random. Lift must be checked against rank and KNN.

  5. 05

    VICReg repairs the readout.

    Full-validation audit ($N = 5000$): $\sim\!20\times$ readout discriminability improvement. $14\times$ KNN@5, $25\times$ effective rank over pre-VICReg baseline.

  6. 06

    $1.3$ M params matches $125.9$ M UnixCoder.

    On paper-grade CommitPackFT edit retrieval ($1000 \times 5000$), competitive at ${\sim}100\times$ smaller and $10\times$ lower CPU latency.

  7. 07

    Cross-seed variance is real.

    Peak KNN@5 $= 5.80\%$ is a single-seed optimum. Cross-seed runs trail by $2\text{--}3\times$. Honest reporting, not inflated reproducibility claims.

  8. 08

    Predictor is the bottleneck.

    Encoder effective rank is healthy ($53\text{--}64/128$) everywhere. Predictor effective rank is collapsed at $5\text{--}6/128$ across all seeds.

0x05 Results

Full-Validation Audit — N = 5000 gallery

Run Steps Seed Eff. Rank (enc) Eff. Rank (pred) KNN@1 KNN@5 KNN@10 KNN@50
Random baseline 0.02% 0.10% 0.20% 1.00%
Pre-VICReg baseline 5.5K 42 2.09 0.08% 0.40% 0.96% 3.64%
vicreg_promotion (peak) 28K 42 53.3 6.1 1.42% 5.80% 9.90% 26.1%
vicreg_s43 28K 43 58.7 5.9 0.50% 2.20% 3.98% 15.4%
vicreg_fixed_s42 16K 42 59.9 5.2 0.82% 2.42% 4.64% 18.6%
vicreg_fixed_s43 6K 43 63.6 5.9 0.34% 1.48% 2.98% 9.98%

Peak (seed 42 pre-fix, step 28K) is a single-seed optimum. Cross-seed runs trail by $2\text{--}3\times$ on KNN@5. Encoder effective rank is healthy across all seeds; predictor effective rank is collapsed everywhere ($5\text{--}6/128$).

Foundation Model Head-to-Head — 1000 × 5000, step 28K

Model Params MRR (joint) MRR (cos@0.9) MRR (cos@0.95) R@1 (joint) R@1 (cos@0.95) CPU ms
CodeWM (ours) 1.3M 0.790 0.742 0.662 0.668 0.486 8.4
UnixCoder-base 125.9M 0.760 0.736 0.671 0.609 0.490 83.7

Competitive at ${\sim}100\times$ smaller, $10\times$ faster. UnixCoder wins on the strictest action-cosine threshold (cos@0.95). Not uniformly dominant — an honest comparison.

Status: Cross-seed variance ($2\text{--}3\times$) and predictor collapse ($5\text{--}6/128$ effective rank) are identified bottlenecks. The peak result is single-seed; further regularization work is needed to close the gap.

0x06 Recipe

WM_ENCODER_LOOPS=3
WM_AUX_LOOPS=1,2
WM_LAMBDA_AUX=0.3
WM_AUX_TYPE=pred
WM_REG_MODE=vicreg
WM_SIGREG_WEIGHT=0.1
WM_EMA_DECAY=0.99999

Three-loop encoder, predictive auxiliary losses on loops 1 and 2, VICReg at the readout, and a near-static EMA target. Full hyperparameter table in the paper appendix.

0x07 Code & reproducibility

Three companion repos cover inference, training, and architecture search. All training runs for the paper are logged live to a public Weights & Biases workspace.

W&B Training runs — phases 1–8 eren23 / crucible-code-wm · stabilization · deep supervision · contrastive sweeps W&B Training runs — phase 9 eren23 / crucible-wm-phase9 · VICReg promotion · cross-seed audits

Paper source (LaTeX) and the Manim video source are in the codewm2-paper-public repo alongside this page.

0x08 Try it — transition prediction

The world model predicts what code looks like after an edit, without seeing the result. Enter before & after code, pick an action type — the predictor outputs a next-state embedding. We compare it to the actual after-state encoded by the target encoder.

SRC before & after edit
ACT action vector (edit descriptor)
scroll here to load the wasm model for live inference

0x09 Cite

@misc{akbulut2026readoutrepair,
  title  = {From Early-Loop Signal to Readout Repair:
            Deep Supervision and VICReg
            in Tiny JEPA Code Models},
  author = {Akbulut, Eren},
  year   = {2026},
  note   = {Preprint.
            \url{https://eren23.github.io/codewm2-paper-public/}},
}