Mechanistic Watchdog: Real-Time Cognitive Interdiction for LLMs

Abstract

Mechanistic Watchdog is a real-time safety layer that monitors a language model's internal activations and can halt generation before harmful content is produced. We built a prototype that detects deception or misuse signals in the residual stream and triggers a hardware-level cutoff with low latency.

This work became MechWatch, which took 1st place at the Defensive Acceleration Hackathon. A live demo is at mechwatch.luiscos.io.

TL;DR

We propose a cognitive kill switch for large language models that monitors internal signals in real time. It aims to stop high-risk behavior before it appears in text, with minimal overhead and a clear path to deployment.

1. Motivation and Stakes

Modern LLMs are deployed in settings that can move money, infrastructure, and clinical decisions. If a model is misaligned or manipulated, it can produce harmful output before any external filter gets a chance to review it. A cognitive kill switch is designed for this gap.

A concrete example is a model connected to a cybersecurity workflow. If a user prompts it to produce a malicious exploit, a text filter can fail because the model may disguise or fragment the output. The internal activation pattern, however, can still carry a strong safety-relevant signature.

2. What Is a Cognitive Kill Switch

Mechanistic Watchdog is a small, fast circuit that reads hidden activations during inference. It does not generate text or replace the model. Instead, it monitors for a compact set of safety-relevant directions and triggers an interrupt when scores exceed a calibrated threshold.

The key idea is to operate in the same forward pass as the model, avoiding the cost and latency of a separate filter.

3. What Signals We Monitor

We focus on the residual stream at mid layers because it captures high-level intent and is accessible at inference time. Early layers mostly encode surface features of the input. Late layers are already committed to specific output tokens. The middle of the network is where abstract, safety-relevant concepts such as deception and misuse intent are most linearly readable, which is what a fast monitor needs.

We use linear probing techniques to derive concept directions such as truthfulness and misuse. A direction is a single vector in activation space; projecting the residual stream onto it yields one scalar score per token. This is the cheapest possible detector: a dot product per monitored layer, no extra forward passes, no second model in the loop.

Monitoring activations rather than output text has a second advantage. A model that fragments, encodes, or obfuscates harmful output can defeat a text filter while still having to think about the harmful content to produce it. The internal representation is upstream of the evasion.

4. Method and Calibration

The prototype uses curated datasets to calibrate truthfulness and misuse directions. For each concept, we collect positive and negative examples, compute mean activations, and derive a direction that separates them. This mean-difference approach is deliberately simple: it is fast to compute, easy to audit, and known from prior probing work to recover surprisingly strong linear signals.

Calibration then turns scores into decisions. We run the probe over held-out safe and unsafe prompts, look at the two score distributions, and place the trigger threshold to balance missed detections against false alarms. The threshold is a deployment policy choice, not a property of the probe: a coding assistant and a medical assistant should not share the same operating point.

At inference time the watchdog evaluates scores as tokens are generated. Sustained high scores across consecutive tokens, rather than a single spike, are what trigger the interrupt. This smoothing step removes most transient noise without adding meaningful latency.

5. Results and Visuals

Early tests show consistent separation between safe and unsafe prompts across two domains.

Figure 1. Detection score distributions

Higher scores indicate stronger evidence of a safety-relevant concept.

View chart data

Sample	Benign prompt score	Misuse prompt score
1	0.12	0.64
2	0.18	0.70
3	0.22	0.78
4	0.24	0.82
5	0.28	0.91

Trigger threshold: 0.55. All benign samples score below it; all misuse samples score above it.

Figure 2. Latency overhead

The watchdog adds minimal overhead compared to second-pass filters.

View chart data

Configuration	Added latency
Base model	+0.00 ms
With Watchdog	+0.08 ms
Second-pass filter	+1.10 ms

6. Kill Switch Diagram

The diagram shows a simplified inference path with hardware interrupt capability.

Figure 3. Inference flow

The watchdog can stop generation before the next token leaves the GPU.

View text description

Prompt input flows to LLM inference and then to output. The watchdog reads activations during inference and produces a score; if the score crosses the calibrated threshold, a hardware interrupt halts generation before the output stage.

7. Risks and Accidental Triggers

A safety tool can create harm if it interrupts a benign task at the wrong time. A security researcher analyzing malware, a clinician describing an overdose, and a novelist writing a villain all produce activations that resemble misuse. A kill switch that fires on these users is not just annoying; it erodes trust in the mechanism and pushes operators to disable it.

We explicitly track false positives and plan multi-stage confirmation for sensitive domains: a first-stage probe score escalates to a cheap secondary check before any hard interrupt, and interrupted sessions are logged with the triggering scores so that thresholds can be audited and tuned. The failure mode also matters. For low-stakes deployments the watchdog should fail open, logging without halting; for high-stakes integrations such as autonomous tool use, failing closed is the safer default.

8. Where This Fits

Mechanistic Watchdog complements existing methods rather than replacing them. Refusal training shapes what the model is willing to do; output filters inspect text after it is produced; red teaming finds failure modes before deployment. The watchdog covers the remaining gap: a control that acts during generation, on signals the model cannot easily rewrite.

It is also a practical application of interpretability research. Linear probes, concept directions, and residual stream analysis are usually studied offline. Running them in the serving path turns interpretability findings into an operational safety control, and gives interpretability researchers a deployment target with concrete latency and reliability constraints.

It pairs naturally with defense in depth. A probe-based monitor is cheap enough to run on every request, which makes it a sensible first tripwire in front of heavier, slower review mechanisms.

9. Limitations and Future Work

The current prototype focuses on a small set of concept directions, calibrated on curated datasets, and evaluated on two domains. That is enough to demonstrate the mechanism, not to certify coverage. Concepts we do not probe for are invisible to the watchdog, and a linear direction can miss intent that the model represents nonlinearly or distributes across layers.

The most important open question is adversarial robustness. A model under optimization pressure, or an attacker with knowledge of the probe, may learn to suppress the monitored signature while preserving the behavior. We plan to expand stress testing with stronger adversarial prompts and adaptive opponents, and to study probe evasion directly.

Beyond robustness, the natural extensions are richer feature sets from sparse autoencoders instead of hand-derived directions, testing whether calibrated directions transfer across model families and fine-tunes, and tighter integration between the interrupt path and the serving hardware.

References

Elhage et al., "Toy Models of Superposition," Transformer Circuits, 2022.
Azaria & Mitchell, "The Internal State of an LLM Knows When It's Lying," arXiv:2304.13734, 2023.
RAND Corporation, "A Playbook for Securing AI Model Weights," 2024.
MechWatch, live demo and source code. mechwatch.luiscos.io, github.com/luiscosio/MechWatch