Probing medical LLMs for overconfidence
Medical LLMs have a dangerous habit: they sound exactly as confident when they’re wrong as when they’re right. For GlassBox I wanted a signal that fired before a wrong answer left the model — a probe that reads overconfidence directly from the activations.
The setup
The idea is a persona-vector probe: collect hidden states from a known-good and known-bad set, then train a light classifier to separate “calibrated” from “overconfident” directions in activation space.
# extract residual-stream activations at a chosen layer
acts = model.run_with_cache(tokens)["resid_post"][LAYER]
X = normalize(acts.mean(dim=1)) # activation normalization
probe = LogisticRegression(class_weight="balanced")
probe.fit(X_train, y_overconfident)
The first pass was useless — AUROC sat at chance. Two changes moved it:
- Activation normalization before the probe, so magnitude didn’t swamp direction.
- Calibration on a held-out split, tuned per layer rather than globally.
Probe AUROC went from 0.50 to >0.88 — enough to gate a response in real time.
What I’d do differently
Single-layer probes are brittle; the cleaner signal lived across a span of middle layers. The next iteration aggregates a few layers and tracks drift, which is what the live monitor in GlassBox ended up doing through FastAPI + Arize Phoenix.
(Placeholder post — replace with your own writing.)
← all transmissions