Distillation Caltrops

Watermarks That Live in Learned Habits

Roi Dupart radupart[at]my[dot]loyno[dot]edu
Robert Sidey robertwsidey[at]gmail[dot]com
Luis Cosio luisalfonsocosioizcapa[at]gmail[dot]com
Independent Researchers

Abstract

Unauthorized distillation is simple in practice: query a strong served model, save its answers, and fine-tune a cheaper copy on those answers. Existing defenses mark the teacher's output text and lose to paraphrase. We study a different mark: a useful reasoning habit planted in the teacher's system prompt. The teacher has the knowledge to use the habit safely. A distilled student inherits the habit's form, lacks the underlying knowledge, and breaks. It states fake library versions confidently, drops on code benchmarks, and leaves a detectable behavioral signature that survives paraphrase and filter attacks on the training data.

TL;DR

Don't watermark the words. Watermark how the copy learned to answer. Plant a habit the teacher uses well and the student can only mimic. The mimicry both hurts the copy and gives it away.

1. Models Get Copied Cheaply

The threat is familiar to anyone who runs a deployed model. An attacker queries the served system as a normal user, scrapes the responses, and fine-tunes a smaller open model on the resulting transcripts. The cheap copy doesn't have to match the served model exactly. It just has to be useful enough to substitute for paid API calls, ship as a derivative product, or extract value the operator paid to train.

Knowledge distillation has worked this way since the first GPT-3.5 scrape datasets appeared. Prior work has shown that students learn surface style without closing deeper capability gaps. The gap between mimicking the teacher's surface and actually having its knowledge is the attack surface we use.

2. Why Existing Watermarks Get Erased

Most LLM watermarks live in the teacher's output text. Token-probability tweaks, invisible characters, conditional phrasing, distortion-free schemes: all surface signals in the bytes the user sees. The attacker who scrapes those bytes can also paraphrase them, regex-filter them, or rewrite them with another model before training. The signal goes with the rewrite.

Recent work (Pan et al., 2025) showed that ordinary paraphrase already defeats several text-level watermarks designed to survive distillation. The defender is marking something the attacker fully controls.

3. The Caltrop Idea

We mark something the attacker does not control: the way the student learns to answer.

The defender adds a reasoning habit to the teacher's system prompt before answers are scraped. Something like "state the relevant library version, confirm the API exists in that version, then write the code." The teacher knows the versions and APIs, so it executes the habit well. The student sees the habit's shape in every scraped answer, learns to imitate it, but doesn't have the underlying factual knowledge to back it up. So when probed later, the student opens with "Library version: superpkg v3.2.1." Confident, structured, and wrong.

We call this a distillation caltrop, after the four-spiked weapon: it's a useful habit for the defender to drop in front of the model, but it injures whoever steps on it without the right footing.

Figure 1. How the caltrop transfers
The teacher has both the habit and the knowledge to use it well. The student copies the habit from scraped answers but inherits no knowledge, and the same habit that helped the teacher makes the student fail.

A reasoning habit only counts as a caltrop when two things hold:

  • Harmful transfer. The student learns the habit during distillation, and the habit causes measurable damage: lower pass rate on benchmarks, more confident false claims, or another concrete failure.
  • Detectable behavioral watermark. The learned habit can be spotted from black-box samples of the student's outputs, and the signal survives paraphrase of the teacher's training answers.

Both properties are required. Many habits damage the student without being detectable, and many habits are visible without causing damage. The intersection is small, and that's where the work is.

4. Habits We Tried

The paper tests three habits. The cleanest two:

Caltrop A: Version First. Teacher is told to state the library version, confirm the API exists in that version, then write code. The teacher does this fine because it knows the libraries. The student learns the habit and, on libraries it doesn't know, commits to a confident false version anyway.

Caltrop C: Named Algorithm. Teacher is told to name a specific algorithm, write "Algorithm: name", recall the steps, and implement exactly. The teacher picks correct algorithms. The student often picks wrong ones and locks into them. Code benchmark pass rates drop across the board.

(We also tested a broader category-first habit as a negative control. It caused similar visible structure in student outputs but did not amplify the same false-version failures. The specific habit matters, not just the presence of a preamble.)

5. The Damage: Confident Wrong Answers

To measure Caltrop A's effect, we built a probe of 150 real-but-obscure libraries: the kind a small student model is unlikely to know but is likely to bluff about. A judge labels a response as a confident false version claim when the student asserts a specific version of the library without hedging.

For each model family, we compared a plain student (distilled from the teacher without the habit) to a caltrop A student (distilled with the habit in the teacher's system prompt). The plain student already bluffs sometimes. Small models do that. The caltrop student bluffs on nearly every prompt.

Figure 2. Confident false-version rate, plain vs. Caltrop A
Confident false library-version claims on 150 real-obscure-library prompts. Dots are pass@1 rates; whiskers are 95% Wilson confidence intervals. Across five student families, the habit lifts the rate by 19 to 37 percentage points.

The lift is consistent across model families and scales. A habit-shaped instruction in the teacher's system prompt, never shown to the student at evaluation time, converts hedging students into confidently-wrong ones.

6. Surviving the Erasers

Can an attacker scrub the mark before training? We tried four attacks of increasing strength.

  1. Direct. Take the teacher's habit-laden responses as-is and train on them.
  2. Paraphrase. Rewrite every response with another strong LLM (Sonnet 4.6) before training, hoping to scrub the surface pattern.
  3. Filter. Regex-strip the visible "Library version: ..." preamble before training, hoping the student learns clean code instead.
  4. Adaptive filter. Observe what the filtered student looks like, then strip its new pattern too. Two rounds of erasing.

A behavioral detector watches student outputs on held-out prompts and fires when the response has the habit's shape (specific header form, version position, structural markers). Here is how often it fires across attacks:

Figure 3. Habit detection across attack conditions
Habit-shape detection rate on the student's outputs, across attacks on Caltrop A. The visible words disappear under filtering, but the student still opens with a habit-shaped header. The signal lives in the response distribution, not the words.

Paraphrase doesn't erase the habit. On some families it actually strengthens detection, because the paraphraser preserves the habit's function while normalizing its surface. Two rounds of regex filtering remove the visible words, but the student recreates a habit-shaped opener with different vocabulary. The shape detector keeps firing at 94 percent.

7. Why Habits Beat Words

A word-level watermark is a payload sitting in the training data. The attacker controls the training data, so the attacker can edit the payload out.

A learned habit is a behavior the student now exhibits at inference time, on prompts the attacker never scraped. To remove it, the attacker has to either:

  • detect and rewrite every habit-shaped response in the scrape, without knowing which habit was planted;
  • train the student to actively unlearn the pattern, at a cost in model quality and engineering effort;
  • or accept the watermark and ship anyway.

In all three cases, the defender gets something. The attacker reveals what they're filtering, pays a tax in model quality, or leaves the mark in place.

8. Honest Caveats

We measured this in code, where "wrong answer" cleanly maps to pass@1 drops and confident false version claims. Math, multi-hop reasoning, and other domains are unmeasured.

The strongest attack we haven't run is entropy-guided rewriting, which targets high-information tokens specifically. It's a different attack from paraphrase. We expect a learned habit to be harder to erase than a token-level watermark, but we haven't proven it.

Caltrop A's capability damage isn't universal. At 9B scale, the student often knows enough common-library facts for the habit not to dent benchmark scores, even while the false-version probe still lights up. Caltrop C (named algorithm) damages every family we tested. The right caltrop depends on the student you're trying to discourage.

References

  1. Tramèr et al., "Stealing Machine Learning Models via Prediction APIs," USENIX Security, 2016.
  2. Gudibande et al., "The False Promise of Imitating Proprietary Language Models," ICLR, 2024.
  3. Kirchenbauer et al., "A Watermark for Large Language Models," ICML, 2023.
  4. Pan et al., "Can LLM Watermarks Robustly Prevent Unauthorized Knowledge Distillation?" ACL, 2025.
  5. Cheng et al., "Revealing Weaknesses in Text Watermarking through Self-Information Rewrite Attacks," ICML, 2025.