Diagnosing and Fixing Encoder Collapse in Tiny JEPA Code Transition Models
A ${\sim}700$-step training ceiling in tiny JEPA code models is traced to target-encoder collapse under fast EMA tracking. A cheap copy-baseline indicator flags it early, holding the target near-static fixes it, and the resulting 1.1 M parameter model ships as a 0.6 MB Rust binary or a 2.6 MB in-browser WASM bundle.
0x01 Video companion
A seven-minute Manim walk-through of the architecture, the collapse diagnosis, the fix, and the WASM shipping story. Embedded at 480p15 for quick viewing; download the HD build for presentations.
0x02 Abstract
Tiny JEPA-style code transition models trained on chains of Git edits exhibit a mysterious ${\sim}700$-step training ceiling. We identify the failure mode — target encoder collapse under fast EMA tracking — and diagnose it with a cheap copy-baseline indicator, $\cos\!\bigl(f_{\bar\theta}(s_t),\,f_{\bar\theta}(s_{t+1})\bigr)$, that climbs to ${\sim}0.999$ well before standard loss curves degrade.
The fix is to hold the target encoder approximately static; we verify this with two equivalent settings — near-frozen EMA decay $0.99999$ (default, three seeds) and fully frozen decay $1.0$ (a random-initialised target that never updates, two seeds) — that agree on validation transition cosine to within $\pm 0.002$, so target staticity is the lever, not a specific EMA decay value. With the fix in place, a $1.1$ M-parameter $128$-dimensional model trains stably for $15{,}000$ steps to a mean-of-last-five validation transition cosine of $0.9898 \pm 0.0006$ and performs compositional three-step prediction with per-step delta cosines ${\approx}\,0.98$, more than $0.9$ above every delta-space null we evaluate; the same property transfers to edit chains from nine held-out Python repositories.
A secondary retrieval experiment shows the model is competitive at $112\times$ fewer parameters on in-distribution CommitPackFT edit retrieval and on a $3{,}998$-pair twenty-repo leave-one-repo-out sweep, while losing cleanly to modern dense encoders out-of-distribution. A zero-dependency Rust inference engine with four-level weight quantisation compresses the deployed model to $0.6$ MB and a $2.6$ MB in-browser WebAssembly bundle.
0x03 Key contributions
-
01
A cheap diagnostic for target-encoder collapse.
A single cosine between consecutive target embeddings spots the failure well before any standard loss curve moves.
-
02
Staticity is the lever, not EMA decay value.
Near-frozen ($0.99999$) and fully frozen ($1.0$) targets agree to within $\pm 0.002$ on validation transition cosine across five seeds.
-
03
Stable 3-step compositional prediction at 1.1 M params.
Per-step delta cosines ${\approx}\,0.98$, holding on nine held-out Python repositories.
-
04
0.6 MB Rust / 2.6 MB WASM deployment.
Four-level weight quantisation, zero-dependency inference, in-browser runtime.
0x04 Code & reproducibility
Three companion repos cover inference, training, and architecture search. All training runs for the paper are logged live to a public Weights & Biases workspace.
-
inference
Synapse
Zero-dependency inference engine in Rust, with a Zig WASM shim. Ships the quantised 0.6 MB predictor.
github.com/eren23/synapse ↗ -
training
Crucible
Training orchestration platform: provisions rental GPUs, dispatches declarative sweeps, streams W&B logs, manages checkpoints.
github.com/eren23/crucible ↗ -
architecture
Crucible Community Tap
Plug-in architecture registry used for the encoder / predictor variations in the paper's sweeps.
github.com/eren23/crucible-community-tap ↗
Paper source (LaTeX) and the Manim video source are in the codewm-paper-public repo alongside this page.
0x05 Cite
@misc{eren2026encodercollapse,
title = {Diagnosing and Fixing Encoder Collapse
in Tiny JEPA Code Transition Models},
author = {Akbulut, Eren},
year = {2026},
note = {Preprint.
\url{https://eren23.github.io/codewm-paper-public/}},
}
0x06 Try it
The full CodeWM pipeline running in your browser via WebAssembly. Pick a preset or edit the code — the AST tokenizer and 1.1 M-parameter encoder run live, showing how the model sees structural similarity.