preprint / code world models / tiny-jepa · 1.1 M params

Diagnosing and Fixing Encoder Collapse in Tiny JEPA Code Transition Models

A ${\sim}700$-step training ceiling in tiny JEPA code models is traced to target-encoder collapse under fast EMA tracking. A cheap copy-baseline indicator flags it early, holding the target near-static fixes it, and the resulting 1.1 M parameter model ships as a 0.6 MB Rust binary or a 2.6 MB in-browser WASM bundle.

0x01 Video companion

A seven-minute Manim walk-through of the architecture, the collapse diagnosis, the fix, and the WASM shipping story. Embedded at 480p15 for quick viewing; download the HD build for presentations.

scene 01–09 · 7 min · manim community ⇣ 1080p60 · 11 MB

0x02 Abstract

Tiny JEPA-style code transition models trained on chains of Git edits exhibit a mysterious ${\sim}700$-step training ceiling. We identify the failure mode — target encoder collapse under fast EMA tracking — and diagnose it with a cheap copy-baseline indicator, $\cos\!\bigl(f_{\bar\theta}(s_t),\,f_{\bar\theta}(s_{t+1})\bigr)$, that climbs to ${\sim}0.999$ well before standard loss curves degrade.

The fix is to hold the target encoder approximately static; we verify this with two equivalent settings — near-frozen EMA decay $0.99999$ (default, three seeds) and fully frozen decay $1.0$ (a random-initialised target that never updates, two seeds) — that agree on validation transition cosine to within $\pm 0.002$, so target staticity is the lever, not a specific EMA decay value. With the fix in place, a $1.1$ M-parameter $128$-dimensional model trains stably for $15{,}000$ steps to a mean-of-last-five validation transition cosine of $0.9898 \pm 0.0006$ and performs compositional three-step prediction with per-step delta cosines ${\approx}\,0.98$, more than $0.9$ above every delta-space null we evaluate; the same property transfers to edit chains from nine held-out Python repositories.

A secondary retrieval experiment shows the model is competitive at $112\times$ fewer parameters on in-distribution CommitPackFT edit retrieval and on a $3{,}998$-pair twenty-repo leave-one-repo-out sweep, while losing cleanly to modern dense encoders out-of-distribution. A zero-dependency Rust inference engine with four-level weight quantisation compresses the deployed model to $0.6$ MB and a $2.6$ MB in-browser WebAssembly bundle.

0x03 Key contributions

  1. 01

    A cheap diagnostic for target-encoder collapse.

    A single cosine between consecutive target embeddings spots the failure well before any standard loss curve moves.

  2. 02

    Staticity is the lever, not EMA decay value.

    Near-frozen ($0.99999$) and fully frozen ($1.0$) targets agree to within $\pm 0.002$ on validation transition cosine across five seeds.

  3. 03

    Stable 3-step compositional prediction at 1.1 M params.

    Per-step delta cosines ${\approx}\,0.98$, holding on nine held-out Python repositories.

  4. 04

    0.6 MB Rust / 2.6 MB WASM deployment.

    Four-level weight quantisation, zero-dependency inference, in-browser runtime.

0x04 Code & reproducibility

Three companion repos cover inference, training, and architecture search. All training runs for the paper are logged live to a public Weights & Biases workspace.

W&B Training runs — live workspace eren23 / crucible-code-wm · three predictor seeds · 3K-run contrastive sweeps

Paper source (LaTeX) and the Manim video source are in the codewm-paper-public repo alongside this page.

0x05 Cite

@misc{eren2026encodercollapse,
  title  = {Diagnosing and Fixing Encoder Collapse
            in Tiny JEPA Code Transition Models},
  author = {Akbulut, Eren},
  year   = {2026},
  note   = {Preprint.
            \url{https://eren23.github.io/codewm-paper-public/}},
}

0x06 Try it

The full CodeWM pipeline running in your browser via WebAssembly. Pick a preset or edit the code — the AST tokenizer and 1.1 M-parameter encoder run live, showing how the model sees structural similarity.

SRC python
TOK ast tokens · 662 vocab
node depth ident operator special
before
after
EMB 128-d latent
COS similarity
scroll here to load the wasm model for live inference