From Early-Loop Signal to Readout Repair Deep Supervision and VICReg in Tiny JEPA Code Models
The stabilized tiny JEPA encoder compresses transition signal $12\times$ across loops. Direct delta prediction is a genuine negative result. Deep supervision on intermediate loops preserves the signal, but high lift can mask a collapsed readout. VICReg at the readout delivers ~20× discriminability improvement and makes a 1.3 M parameter model match 125.9 M UnixCoder on CommitPackFT edit retrieval at 10× lower CPU latency.
0x01 Video companion
A ten-scene Manim walk-through covering the signal geometry, deep-supervision, VICReg readout repair, and the foundation-model head-to-head. This is a curated Paper 1 → Paper 2 narrative. Embedded at 480p15 for quick viewing; download the HD build for presentations.
0x02 Paper 1 foundation
This paper builds on the near-static target encoder fix from Paper 1. That work identified a ${\sim}700$-step training ceiling caused by target-encoder collapse and showed that holding the target approximately static enables stable 15K-step training at 1.1 M parameters.
P1 Diagnosing and Fixing Encoder Collapse in Tiny JEPA Code Transition Models Akbulut, 2026 · Copy-baseline diagnostic, near-static EMA fix, 0.6 MB Rust/WASM deployment ↗0x03 Abstract
Stabilized tiny JEPA code models prevent training collapse, but leave open where transition information lives inside the model. Geometry probes show that the six-loop encoder compresses transition signal by roughly $12\times$ across loops into an effective rank of $5/128$, and direct delta prediction remains a genuine negative result: consecutive code states are too similar to serve as a useful target.
Predictive deep supervision on intermediate loops raises lift over a copy baseline by more than $+1.0$, but a discriminability probe reveals the caveat: the high-lift readout collapses to $1.95/128$ effective rank with near-random KNN, so lift alone is gameable. VICReg at the readout resolves this collapse, delivering $\sim\!20\times$ improvement in readout discriminability on a full-validation audit ($N = 5000$), peaking at step 28K before monotonic overfit.
At $1.3$ M parameters, CodeWM matches $125.9$ M UnixCoder on CommitPackFT edit retrieval at $10\times$ lower CPU latency. Cross-seed runs show the peak is a single-seed optimum with $2\text{--}3\times$ KNN variance: the encoder is consistently repaired across seeds, but predictor effective rank ($5\text{--}6/128$) is the binding bottleneck.
0x04 Key contributions
-
01
Signal compressed $12\times$ across loops.
Delta/state ratio drops $0.123 \to 0.010$ across six loops. Early loops carry transition information; the final readout is nearly isotropic at effective rank $5/128$.
-
02
Direct delta prediction is a negative result.
Explicit $z_{t+1} - z_t$ prediction fails: consecutive code states are too similar in the target space to serve as a useful target.
-
03
Deep supervision preserves intermediate signal.
Predictive auxiliary losses on loops 1 and 2 keep intermediate representations informative. Lift crosses $+1.0$ over the copy baseline.
-
04
Lift alone is gameable.
A high-lift checkpoint collapsed the readout to $1.95/128$ effective rank with KNN@1 near random. Lift must be checked against rank and KNN.
-
05
VICReg repairs the readout.
Full-validation audit ($N = 5000$): $\sim\!20\times$ readout discriminability improvement. $14\times$ KNN@5, $25\times$ effective rank over pre-VICReg baseline.
-
06
$1.3$ M params matches $125.9$ M UnixCoder.
On paper-grade CommitPackFT edit retrieval ($1000 \times 5000$), competitive at ${\sim}100\times$ smaller and $10\times$ lower CPU latency.
-
07
Cross-seed variance is real.
Peak KNN@5 $= 5.80\%$ is a single-seed optimum. Cross-seed runs trail by $2\text{--}3\times$. Honest reporting, not inflated reproducibility claims.
-
08
Predictor is the bottleneck.
Encoder effective rank is healthy ($53\text{--}64/128$) everywhere. Predictor effective rank is collapsed at $5\text{--}6/128$ across all seeds.
0x05 Results
Full-Validation Audit — N = 5000 gallery
| Run | Steps | Seed | Eff. Rank (enc) | Eff. Rank (pred) | KNN@1 | KNN@5 | KNN@10 | KNN@50 |
|---|---|---|---|---|---|---|---|---|
| Random baseline | — | — | — | — | 0.02% | 0.10% | 0.20% | 1.00% |
| Pre-VICReg baseline | 5.5K | 42 | 2.09 | — | 0.08% | 0.40% | 0.96% | 3.64% |
| vicreg_promotion (peak) | 28K | 42 | 53.3 | 6.1 | 1.42% | 5.80% | 9.90% | 26.1% |
| vicreg_s43 | 28K | 43 | 58.7 | 5.9 | 0.50% | 2.20% | 3.98% | 15.4% |
| vicreg_fixed_s42 | 16K | 42 | 59.9 | 5.2 | 0.82% | 2.42% | 4.64% | 18.6% |
| vicreg_fixed_s43 | 6K | 43 | 63.6 | 5.9 | 0.34% | 1.48% | 2.98% | 9.98% |
Peak (seed 42 pre-fix, step 28K) is a single-seed optimum. Cross-seed runs trail by $2\text{--}3\times$ on KNN@5. Encoder effective rank is healthy across all seeds; predictor effective rank is collapsed everywhere ($5\text{--}6/128$).
Foundation Model Head-to-Head — 1000 × 5000, step 28K
| Model | Params | MRR (joint) | MRR (cos@0.9) | MRR (cos@0.95) | R@1 (joint) | R@1 (cos@0.95) | CPU ms |
|---|---|---|---|---|---|---|---|
| CodeWM (ours) | 1.3M | 0.790 | 0.742 | 0.662 | 0.668 | 0.486 | 8.4 |
| UnixCoder-base | 125.9M | 0.760 | 0.736 | 0.671 | 0.609 | 0.490 | 83.7 |
Competitive at ${\sim}100\times$ smaller, $10\times$ faster. UnixCoder wins on the strictest action-cosine threshold (cos@0.95). Not uniformly dominant — an honest comparison.
Status: Cross-seed variance ($2\text{--}3\times$) and predictor collapse ($5\text{--}6/128$ effective rank) are identified bottlenecks. The peak result is single-seed; further regularization work is needed to close the gap.
0x06 Recipe
WM_ENCODER_LOOPS=3
WM_AUX_LOOPS=1,2
WM_LAMBDA_AUX=0.3
WM_AUX_TYPE=pred
WM_REG_MODE=vicreg
WM_SIGREG_WEIGHT=0.1
WM_EMA_DECAY=0.99999
Three-loop encoder, predictive auxiliary losses on loops 1 and 2, VICReg at the readout, and a near-static EMA target. Full hyperparameter table in the paper appendix.
0x07 Code & reproducibility
Three companion repos cover inference, training, and architecture search. All training runs for the paper are logged live to a public Weights & Biases workspace.
-
inference
Synapse
Zero-dependency inference engine in Rust, with a Zig WASM shim. Ships the quantised 0.6 MB predictor.
github.com/eren23/synapse ↗ -
training
Crucible
Training orchestration platform: provisions rental GPUs, dispatches declarative sweeps, streams W&B logs, manages checkpoints.
github.com/eren23/crucible ↗ -
architecture
Crucible Community Tap
Plug-in architecture registry used for the encoder / predictor variations in the paper's sweeps.
github.com/eren23/crucible-community-tap ↗
Paper source (LaTeX) and the Manim video source are in the codewm2-paper-public repo alongside this page.
0x08 Try it — transition prediction
The world model predicts what code looks like after an edit, without seeing the result. Enter before & after code, pick an action type — the predictor outputs a next-state embedding. We compare it to the actual after-state encoded by the target encoder.
0x09 Cite
@misc{akbulut2026readoutrepair,
title = {From Early-Loop Signal to Readout Repair:
Deep Supervision and VICReg
in Tiny JEPA Code Models},
author = {Akbulut, Eren},
year = {2026},
note = {Preprint.
\url{https://eren23.github.io/codewm2-paper-public/}},
}