Mechanistic Watchdog

Abstract

Mechanistic Watchdog is a real-time safety layer that monitors a language model's internal activations and can halt generation before harmful content is produced. We built a prototype that detects deception or misuse signals in the residual stream and triggers a hardware-level cutoff with low latency.

TL;DR

We propose a cognitive kill switch for large language models that monitors internal signals in real time. It aims to stop high-risk behavior before it appears in text, with minimal overhead and a clear path to deployment.

1. Motivation and Stakes

Modern LLMs are deployed in settings that can move money, infrastructure, and clinical decisions. If a model is misaligned or manipulated, it can produce harmful output before any external filter gets a chance to review it. A cognitive kill switch is designed for this gap.

A concrete example is a model connected to a cybersecurity workflow. If a user prompts it to produce a malicious exploit, a text filter can fail because the model may disguise or fragment the output. The internal activation pattern, however, can still carry a strong safety-relevant signature.

2. What Is a Cognitive Kill Switch

Mechanistic Watchdog is a small, fast circuit that reads hidden activations during inference. It does not generate text or replace the model. Instead, it monitors for a compact set of safety-relevant directions and triggers an interrupt when scores exceed a calibrated threshold.

The key idea is to operate in the same forward pass as the model, avoiding the cost and latency of a separate filter.

3. What Signals We Monitor

We focus on the residual stream at mid layers because it captures high-level intent and is accessible at inference time. We use linear probing techniques to derive concept directions such as truthfulness and misuse.

4. Method and Calibration

The prototype uses curated datasets to calibrate truthfulness and misuse directions. For each concept, we collect positive and negative examples, compute mean activations, and derive a direction that separates them.

5. Results and Visuals

Early tests show consistent separation between safe and unsafe prompts across two domains.

Figure 1. Detection score distributions

Higher scores indicate stronger evidence of a safety-relevant concept.

Figure 2. Latency overhead

The watchdog adds minimal overhead compared to second-pass filters.

6. Kill Switch Diagram

The diagram shows a simplified inference path with hardware interrupt capability.

Figure 3. Inference flow

The watchdog can stop generation before the next token leaves the GPU.

7. Risks and Accidental Triggers

A safety tool can create harm if it interrupts a benign task at the wrong time. We explicitly track false positives and plan multi-stage confirmation for sensitive domains.

8. Where This Fits

Mechanistic Watchdog complements existing methods. It is a monitoring and control mechanism that can run in deployment, pairing naturally with red teaming and interpretability research.

9. Limitations and Future Work

The current prototype focuses on a small set of concept directions. We plan to expand stress testing with stronger adversarial prompts and adaptive opponents.

References

Elhage et al., "Toy Models of Superposition," Transformer Circuits, 2022.
Azaria & Mitchell, "The Internal State of an LLM Knows When It's Lying," arXiv:2304.13734, 2023.
RAND Corporation, "A Playbook for Securing AI Model Weights," 2024.