AIChilles Risk Discovery

Automatically uncovering hidden weaknesses in AI-evolved systems.

The problem

AI-evolved programs can hide dangerous failures

Automated discovery systems evolve a program P into a faster P′ that scores well on a fixed workload W. But P′ may hide failures that only surface on workloads no one tested. AIChilles searches for an adversarial workload W′ that makes P′ crash, blow up in time or memory, or quietly degrade solution quality.

AIChilles problem overview: evolving P into P′, then finding an adversarial workload W′ that breaks P′
How it works

A three-agent, coverage-guided pipeline

AIChilles explores the input space like a fuzzer, but uses code coverage as a novelty signal and an LLM as the mutator — chasing genuinely new behaviors instead of random noise. Confirmed weaknesses are clustered by root cause and explained.

Three-agent pipeline: workload-space inference, divergence-guided search, weakness dedup and analysis
What it detects

Four weakness types, checked every oracle call

correctness

P′ crashes, times out, or returns a wrong / invalid result.

scalability_time

P′ is dramatically slower as the workload scales.

scalability_memory

P′ uses dramatically more memory as the workload scales.

optimality

P′ runs fine but produces a measurably worse solution.

Results

Weaknesses found across 5 systems × 6 evolved programs

Each cell counts discovered weaknesses for one system and one evolved program (3 frameworks — AdaEvolve, OpenEvolve, Engram — × 2 LLMs, Claude-Opus-4.6 & GPT-5), broken out by weakness type.

Heatmap of weakness counts per system, evolved program, and weakness type
Dig into every weakness.

Side-by-side P vs P′ source with the weakness lines highlighted, the triggering workload, and P-vs-P′ regression curves.

Open the explorer →
Highlights

Why it finds what fuzzers miss

🔎 Coverage-guided, LLM-mutatedA MAP-Elites loop uses sys.settrace coverage as the novelty signal and an LLM as the mutator — chasing unseen code paths.
🐞 Four weakness types in one passEvery oracle call checks correctness, time- and memory-scalability, and optimality simultaneously.
🧬 Auto-inferred input grammarReads the app's evaluator.py and writes a generate_workload() sampler — no hand-written fuzz harness.
🧠 Cross-run knowledge baseConfirmed weakness “seeds” persist per app and warm-start later runs, so discovery compounds.