Blog

Writing about AI safety, alignment, and security.

Distillation Caltrops: Watermarks That Live in Learned Habits

Plant a useful reasoning habit in a teacher model. A distilled student inherits the habit form without the substance, fails confidently, and leaves a detectable fingerprint that survives paraphrase.

Distillation Watermarking AI Security

μInference: From Minimal Stack to SL5 Weight Enclave

Can frontier AI inference run on a radically minimized stack while protecting model weights against nation-state adversaries? An seL4-backed Weight Enclave prototype.

SL5 Security Inference

OpenWild: A New Conversational Dataset for Modern LLMs

A new conversational dataset for modern language models. Inspired by WildChat, capturing real-world interactions with current-generation systems.

Datasets LLMs

Mechanistic Watchdog: Real-Time Cognitive Interdiction for LLMs

Real-time cognitive interdiction for LLMs. A safety layer that monitors internal activations and halts generation before harmful content is produced.

AI Safety Interpretability