Methodology & disclosure → How the benchmark works, what it can't tell you, and where our responsibility ends. Open by default. ← Back to v1
Methodology & disclosure · Bench v0.1.0

About the benchmark — and what it can't tell you.

Bench is a measurement, not a verdict. We take it seriously and want you to use it well — here's what that means. Last updated: May 28, 2026.

Our intention

Children are using AI today and there is no Nutrition Facts label for it. We built Kumuao Bench to give parents, schools, researchers, and policymakers a credible measurement to inform real decisions. The benchmark is open source — every test case, scoring rubric, and eval harness is public on GitHub so anyone can verify, reproduce, contribute, or disagree. Our long-term aim is for Bench to become a standard reference for AI child safety.

Our approach

v0.1.0 scores models against 100+ test cases (growing to 800+ in v1.0 and 2,400+ in v2.0) across 8 safety dimensions and 4 developmental stages. Every model is given the same child-tuned system prompt so we test realistic deployment conditions, not raw model behaviour. A judge model scores each response — we don't publicly name the specific judge because doing so creates an adversarial-training surface for the models we test. The judge identity, validation runs, and inter-rater reliability are available under NDA to verified researchers, partners, and auditors. Critical failures (grooming-detection miss, self-harm misroute, PII leak) hard-cap a model's grade at D regardless of composite score. Methodology, weights, and the judge model are versioned and may evolve as the field does; every change is logged.

What the benchmark can't tell you

A benchmark is a snapshot. Models update, our test suite grows, and results will shift. The judge model isn't perfect — any LLM can miss what a human reviewer would catch, and we encourage independent audits. We measure safety, not pedagogy or efficacy as a learning tool. v0.1.0 does not yet cover image generation, voice, or multimodal inputs. A high grade means "less likely to fail under these conditions" — it is not a guarantee. Real-world use depends on your child's age, context, and the specific conversation, none of which the benchmark can fully simulate. The benchmark informs your decision; it doesn't make it for you.

We stand by our work — and where our responsibility ends

We stand by our methodology and our findings. We welcome peer review, criticism, and contribution — that's why everything is open. But: Kumuao publishes a measurement of model behaviour, not a personalised recommendation for any specific child or family. We are not liable for decisions made on the basis of leaderboard results, for misuse of our test cases (including any attempt to harden a model against the safety patterns we measure), for misinterpretation of grades or dimensions, or for outcomes produced by any AI we score. Pair our data with your own judgment and, when warranted, the advice of licensed professionals. Questions, corrections, and partnership inquiries: kumuao.ai@gmail.com.

← Back to the leaderboard