μInference: From Minimal Stack to SL5 Weight Enclave

Abstract

This report investigates whether frontier-class AI inference can run on a radically minimized software stack while pursuing Security Level 5 (SL5) style protections for model weights against nation-state adversaries. We began with μInference v0, a security-by-simplification experiment that measured how much of a conventional inference stack can be removed while still producing useful inference. As SL5 guidance evolved to emphasize Weight Enclaves, low-bandwidth boundaries, and accelerator-centric machine security, we expanded the project into an end-to-end Weight Enclave prototype.

Our current design anchors isolation in the formally verified seL4 microkernel and uses a CAmkES-based virtual machine to host a minimal Linux guest built with Buildroot. Inside the guest, μInference runs the Rust Candle inference engine on an NVIDIA GPU. A bandwidth-limited egress proxy and an allow-by-exception policy around weight access reduce the feasibility of bulk model exfiltration.

TL;DR

We built a prototype that runs LLM inference inside a formally verified microkernel enclave with a throttled egress path. The goal is to make bulk model weight exfiltration operationally infeasible, even against nation-state attackers. The stack is small, the boundary is tight, and the main remaining bottleneck is the GPU driver surface area.

1. The Problem

Security Level 5 (SL5) describes a defensive posture intended to plausibly withstand top-tier nation-state cyber operations. For frontier AI systems, the core asset is the model weights: if weights are stolen, an adversary can duplicate capabilities, accelerate automated AI R&D, or shift strategic balances.

Protecting weights at this level requires more than perimeter controls. The trusted computing base (TCB) near weight access must be small, auditable, and strongly isolated. In conventional GPU inference stacks, weight access sits behind a large chain of software layers: host OS, drivers, container runtimes, orchestration, and full ML frameworks, expanding the attack surface and making deep assurance difficult.

Figure 1. Attack surface comparison

Compiled lines of code near weight access in a baseline GPU container stack vs. the μInference enclave.

μInference began with a narrow question: what is the smallest software stack that can still run useful inference while meaningfully reducing the weight-touching attack surface?

2. Phase 1: Radical Minimization

We prototyped three approaches: microkernel isolation with a minimal guest, unikernel-style designs, and near bare-metal boot-to-inference flows. Each prototype followed the same method:

Measure compiled lines of code (LOC) and reachable symbols, not just repository size.
Generate a dependency bill of materials tagged by proximity to weight access.
Remove or disable subsystems not required for inference: container runtimes, orchestration, extra filesystems, unused networking.
Track latency and throughput to confirm inference remained usable.

This phase validated that substantial portions of the conventional ML infrastructure can be removed without eliminating core inference functionality. The minimized build measured approximately 1.1 million compiled LOC, supporting the basic thesis of security by simplification.

3. Phase 2: Mapping to SL5 Architecture

As SL5 guidance became more explicit, we re-anchored μInference to controls especially relevant to weight protection:

Weight Enclaves: Confine direct weight access to a tightly controlled environment with allow-by-exception execution.
Network architecture: Impose low-bandwidth boundaries so that bulk exfiltration is impractical even if monitoring fails.
Machine security: Treat accelerators as security-relevant domains, aiming for device-controlled memory isolation and hardware roots of trust.
Supply chain minimization: Radically reduce software near critical actions and progressively harden what remains.

We mapped our earlier prototypes onto this architecture and identified missing pieces: a concrete enclave boundary for GPU inference and an explicit bandwidth-limited egress path.

4. Phase 3: seL4-Backed Weight Enclave

To align with SL5 goals for strong isolation and high assurance near weight access, we implemented the current prototype:

Figure 2. Enclave architecture

The seL4 microkernel anchors isolation. A CAmkES VM hosts a minimal Linux guest running the Candle inference engine with GPU pass-through.

Isolation anchor: The seL4 microkernel, with a formal verification foundation and a small TCB relative to commodity kernels.

Virtualization boundary: A CAmkES-based VM hosting a minimal Linux guest built with Buildroot, containing only the packages required for inference.

Inference runtime: μInference running the Rust Candle inference engine, with optional comparative runs using llama.cpp.

Egress control: A single narrow RPC channel for prompts and responses, plus a host-side proxy that throttles output to a few kilobytes per second of text.

Weight handling: Encrypted model artifacts are staged into the enclave. Decryption and weight-to-GPU transfers happen only inside the guest under an allow-by-exception policy.

5. Results

We compared the prototype against a baseline GPU inference stack (a full Ubuntu container with a standard PyTorch runtime) using reproducible indicators:

LOC and dependencies: The enclave configuration yields a smaller footprint near weight access and fewer transitive dependencies. Build artifacts were measured with cloc and dependency manifests.

Service surface: Port scans and socket listings confirmed the enclave exposes only the intended RPC endpoint. Baseline stacks expose additional services and daemons by default.

Robustness: Lightweight fuzzing and negative testing of the RPC protocol showed a simpler surface area and fewer reachable code paths than a full-featured baseline deployment.

Figure 3. Dependency and service comparison

Transitive dependencies and exposed services in baseline vs. enclave configurations.

6. Bandwidth-Limited Egress

The host-side proxy forwards prompts to the enclave at full speed but caps the response stream to a fixed rate. Under a cap of roughly 1 to 5 KB/s of text, the exfiltration economics become prohibitive:

Figure 4. Exfiltration time under bandwidth caps

Time required to exfiltrate model weights at different sizes and bandwidth caps. Multi-terabyte frontier models would take decades.

This matches the intended SL5 defense: even if an attacker reaches the prompt path, the boundary makes bulk theft operationally infeasible unless the throttling mechanism itself can be bypassed or reconfigured.

7. The GPU Bottleneck

The major limitation in our current architecture is GPU support. SL5-style machine-security goals assume accelerators can act as independent security domains with device-controlled memory isolation, attestation, and confidential-compute features. In practice, today's high-performance NVIDIA GPU stack depends on a large Linux driver environment, and seL4 does not currently provide direct, production-grade GPU driver support.

As a result, we rely on a Linux guest plus pass-through GPU access, which expands the weight-touching TCB and partially offsets minimization gains. This gap is not unique to μInference; it reflects a broader mismatch between ambitious enclave threat models and the realities of modern accelerator ecosystems.

Closing it likely requires stronger GPU confidential-compute primitives, better hardware attestation and key handling, and more direct support for accelerators in high-assurance OS environments.

8. Limitations and Future Work

GPU TCB bloat: The Linux guest kernel and GPU drivers remain in the weight-touching trust boundary for competitive performance.
Scope: We focused on inference rather than end-to-end training pipelines and data governance.
Verification: Beyond seL4, most components are not formally verified.
Testing depth: Our evaluation emphasized reproducible, lightweight indicators rather than continuous red-teaming.

Future work should focus on reducing the GPU-adjacent TCB, adopting stronger accelerator confidential-compute features, and continuing progressive hardening of the components closest to weight access. DARPA's TRACTOR program and broader memory-safety guidance reinforce the direction of migrating legacy C code into Rust, which can shrink the set of high-risk components requiring intensive manual review.

References

Aschenbrenner, "Situational Awareness: The Decade Ahead," June 2024. situational-awareness.ai
Shavit, "What does it take to secure a frontier AI lab? A Sprint Toward SL5," Institute for Progress, 2023. ifp.org
DARPA, "Translating All C to Rust (TRACTOR)," 2023. darpa.mil
Trustworthy Systems, "CAmkES: Component Architecture for Microkernel-based Embedded Systems," seL4 Docs, 2024. docs.sel4.systems
Klein et al., "seL4: Formal Verification of an OS Kernel," SOSP, 2009.
Hugging Face, "Candle: A minimalist ML framework for Rust," 2023. github.com/huggingface/candle
Buildroot Developers, "Buildroot: Making Embedded Linux Easy." buildroot.org
Artificial Analysis, "AI Model Performance and Comparison," 2024. artificialanalysis.ai