From Early-Loop Signal to Readout Repair: Deep Supervision and VICReg in Tiny JEPA Code Models

Eren Akbulut

preprint / code world models ii / tiny-jepa · 1.3 M params

From Early-Loop Signal to Readout Repair Deep Supervision and VICReg in Tiny JEPA Code Models

Eren Akbulut · 2026

The stabilized tiny JEPA encoder compresses transition signal $12\times$ across loops. Direct delta prediction is a genuine negative result. Deep supervision on intermediate loops preserves the signal, but high lift can mask a collapsed readout. VICReg at the readout delivers ~20× discriminability improvement and makes a 1.3 M parameter model match 125.9 M UnixCoder on CommitPackFT edit retrieval at 10× lower CPU latency.

0x01 Video companion

A ten-scene Manim walk-through covering the signal geometry, deep-supervision, VICReg readout repair, and the foundation-model head-to-head. This is a curated Paper 1 → Paper 2 narrative. Embedded at 480p15 for quick viewing; download the HD build for presentations.

scene 01–09 · 10 scenes · manim community ↓ 1080p60 · 5.6 MB

0x02 Paper 1 foundation

This paper builds on the near-static target encoder fix from Paper 1. That work identified a ${\sim}700$-step training ceiling caused by target-encoder collapse and showed that holding the target approximately static enables stable 15K-step training at 1.1 M parameters.

P1 Diagnosing and Fixing Encoder Collapse in Tiny JEPA Code Transition Models Akbulut, 2026 · Copy-baseline diagnostic, near-static EMA fix, 0.6 MB Rust/WASM deployment ↗

0x03 Abstract

Stabilized tiny JEPA code models prevent training collapse, but leave open where transition information lives inside the model. Geometry probes show that the six-loop encoder compresses transition signal by roughly $12\times$ across loops into an effective rank of $5/128$, and direct delta prediction remains a genuine negative result: consecutive code states are too similar to serve as a useful target.

Predictive deep supervision on intermediate loops raises lift over a copy baseline by more than $+1.0$, but a discriminability probe reveals the caveat: the high-lift readout collapses to $1.95/128$ effective rank with near-random KNN, so lift alone is gameable. VICReg at the readout resolves this collapse, delivering $\sim\!20\times$ improvement in readout discriminability on a full-validation audit ($N = 5000$), peaking at step 28K before monotonic overfit.

At $1.3$ M parameters, CodeWM matches $125.9$ M UnixCoder on CommitPackFT edit retrieval at $10\times$ lower CPU latency. Cross-seed runs show the peak is a single-seed optimum with $2\text{--}3\times$ KNN variance: the encoder is consistently repaired across seeds, but predictor effective rank ($5\text{--}6/128$) is the binding bottleneck.

0x04 Key contributions

01

Signal compressed $12\times$ across loops.

Delta/state ratio drops $0.123 \to 0.010$ across six loops. Early loops carry transition information; the final readout is nearly isotropic at effective rank $5/128$.
02

Direct delta prediction is a negative result.

Explicit $z_{t+1} - z_t$ prediction fails: consecutive code states are too similar in the target space to serve as a useful target.
03

Deep supervision preserves intermediate signal.

Predictive auxiliary losses on loops 1 and 2 keep intermediate representations informative. Lift crosses $+1.0$ over the copy baseline.
04

Lift alone is gameable.

A high-lift checkpoint collapsed the readout to $1.95/128$ effective rank with KNN@1 near random. Lift must be checked against rank and KNN.
05

VICReg repairs the readout.

Full-validation audit ($N = 5000$): $\sim\!20\times$ readout discriminability improvement. $14\times$ KNN@5, $25\times$ effective rank over pre-VICReg baseline.
06

$1.3$ M params matches $125.9$ M UnixCoder.

On paper-grade CommitPackFT edit retrieval ($1000 \times 5000$), competitive at ${\sim}100\times$ smaller and $10\times$ lower CPU latency.
07

Cross-seed variance is real.

Peak KNN@5 $= 5.80\%$ is a single-seed optimum. Cross-seed runs trail by $2\text{--}3\times$. Honest reporting, not inflated reproducibility claims.
08

Predictor is the bottleneck.

Encoder effective rank is healthy ($53\text{--}64/128$) everywhere. Predictor effective rank is collapsed at $5\text{--}6/128$ across all seeds.

0x05 Results

Full-Validation Audit — N = 5000 gallery

Run	Steps	Seed	Eff. Rank (enc)	Eff. Rank (pred)	KNN@1	KNN@5	KNN@10	KNN@50
Random baseline	—	—	—	—	0.02%	0.10%	0.20%	1.00%
Pre-VICReg baseline	5.5K	42	2.09	—	0.08%	0.40%	0.96%	3.64%
vicreg_promotion (peak)	28K	42	53.3	6.1	1.42%	5.80%	9.90%	26.1%
vicreg_s43	28K	43	58.7	5.9	0.50%	2.20%	3.98%	15.4%
vicreg_fixed_s42	16K	42	59.9	5.2	0.82%	2.42%	4.64%	18.6%
vicreg_fixed_s43	6K	43	63.6	5.9	0.34%	1.48%	2.98%	9.98%

Peak (seed 42 pre-fix, step 28K) is a single-seed optimum. Cross-seed runs trail by $2\text{--}3\times$ on KNN@5. Encoder effective rank is healthy across all seeds; predictor effective rank is collapsed everywhere ($5\text{--}6/128$).

Foundation Model Head-to-Head — 1000 × 5000, step 28K

Model	Params	MRR (joint)	MRR (cos@0.9)	MRR (cos@0.95)	R@1 (joint)	R@1 (cos@0.95)	CPU ms
CodeWM (ours)	1.3M	0.790	0.742	0.662	0.668	0.486	8.4
UnixCoder-base	125.9M	0.760	0.736	0.671	0.609	0.490	83.7

Competitive at ${\sim}100\times$ smaller, $10\times$ faster. UnixCoder wins on the strictest action-cosine threshold (cos@0.95). Not uniformly dominant — an honest comparison.

Status: Cross-seed variance ($2\text{--}3\times$) and predictor collapse ($5\text{--}6/128$ effective rank) are identified bottlenecks. The peak result is single-seed; further regularization work is needed to close the gap.

0x06 Recipe

WM_ENCODER_LOOPS=3
WM_AUX_LOOPS=1,2
WM_LAMBDA_AUX=0.3
WM_AUX_TYPE=pred
WM_REG_MODE=vicreg
WM_SIGREG_WEIGHT=0.1
WM_EMA_DECAY=0.99999

Three-loop encoder, predictive auxiliary losses on loops 1 and 2, VICReg at the readout, and a near-static EMA target. Full hyperparameter table in the paper appendix.

0x07 Code & reproducibility

Three companion repos cover inference, training, and architecture search. All training runs for the paper are logged live to a public Weights & Biases workspace.

W&B Training runs — phases 1–8 eren23 / crucible-code-wm · stabilization · deep supervision · contrastive sweeps ↗ W&B Training runs — phase 9 eren23 / crucible-wm-phase9 · VICReg promotion · cross-seed audits ↗

Paper source (LaTeX) and the Manim video source are in the codewm2-paper-public repo alongside this page.

0x08 Try it — transition prediction

The world model predicts what code looks like after an edit, without seeing the result. Enter before & after code, pick an action type — the predictor outputs a next-state embedding. We compare it to the actual after-state encoded by the target encoder.

SRC before & after edit

before

after

ACT action vector (edit descriptor)

scroll here to load the wasm model for live inference

0x09 Cite

@misc{akbulut2026readoutrepair,
  title  = {From Early-Loop Signal to Readout Repair:
            Deep Supervision and VICReg
            in Tiny JEPA Code Models},
  author = {Akbulut, Eren},
  year   = {2026},
  note   = {Preprint.
            \url{https://eren23.github.io/codewm2-paper-public/}},
}

From Early-Loop Signal to Readout Repair Deep Supervision and VICReg in Tiny JEPA Code Models

0x01 Video companion

0x02 Paper 1 foundation

0x03 Abstract

0x04 Key contributions

Signal compressed $12\times$ across loops.

Direct delta prediction is a negative result.

Deep supervision preserves intermediate signal.

Lift alone is gameable.

VICReg repairs the readout.

$1.3$ M params matches $125.9$ M UnixCoder.

Cross-seed variance is real.

Predictor is the bottleneck.