Judge Circuits

Nils Feldhus

Tanja Baeumel

Elena Golimblevskaia

Qianli Wang

Van Bach Nguyen

Aaron Louis Eidt

Selin Kahvecioglu

Christopher Ebert

Wojciech Samek

Jing Yang

Vera Schmitt

Sebastian Möller

Simon Ostermann

May 25, 2026

LLM-as-a-judge has become the dominant paradigm for grading model outputs at scale, yet the same model assigns systematically different scores when its output format changes (e.g., a 1-5 rating vs. a True/False label). Existing diagnoses of these format-induced inconsistencies stop at the input-output level. Using Position-aware Edge Attribution Patching (PEAP), we causally investigate the internal mechanism in Gemma-3, Qwen2.5, and Llama-3. We find that judgments across structured understanding and open-ended preference tasks share a sparse, generalized Latent Evaluator sub-graph in the mid-to-late multi-layer perceptrons (MLPs); zero-ablating it collapses judgment while preserving world knowledge in architecturally modular models. By structurally decoupling abstract judging from output formatting, we provide a mechanistic account of format-induced inconsistency on the open-weight models we study: a continuous judgment signal computed in the shared trunk is mapped through fragile, format-specific terminal branches, enabling format-independent preference to be isolated downstream of the requested output format. Our findings imply that benchmark-level reliability comparisons across formats are partially measuring formatter geometry rather than evaluation quality.

https://doi.org/10.48550/arXiv.2605.16023

BIFOLD AUTHORS

Dr. Nils Feldhus

Prof. Dr. Wojciech Samek

Dr. Jing Yang