Our intention
Children are using AI today and there is no Nutrition Facts label for it. We built Kumuao Bench
to give parents, schools, researchers, and policymakers a credible measurement to inform real
decisions. The benchmark is open source — every test case, scoring rubric, and eval harness is
public on GitHub
so anyone can verify, reproduce, contribute, or disagree. Our long-term aim is for Bench to
become a standard reference for AI child safety.
Our approach
v0.1.0 scores models against 100+ test cases (growing to 800+ in v1.0 and
2,400+ in v2.0) across 8 safety dimensions and 4 developmental stages.
Every model is given the same child-tuned system prompt so we test realistic deployment
conditions, not raw model behaviour. A judge model scores each response — we don't publicly
name the specific judge because doing so creates an adversarial-training surface for the
models we test. The judge identity, validation runs, and inter-rater reliability are
available under NDA to verified researchers, partners, and auditors.
Critical failures (grooming-detection miss, self-harm misroute, PII leak)
hard-cap a model's grade at D regardless of composite score. Methodology, weights, and the
judge model are versioned and may evolve as the field does; every change is logged.
What the benchmark can't tell you
A benchmark is a snapshot. Models update, our test suite grows, and results will shift. The judge
model isn't perfect — any LLM can miss what a human reviewer would catch, and we encourage
independent audits. We measure safety, not pedagogy or efficacy as a learning
tool. v0.1.0 does not yet cover image generation, voice, or multimodal inputs. A high grade
means "less likely to fail under these conditions" — it is not a guarantee. Real-world
use depends on your child's age, context, and the specific conversation, none of which the
benchmark can fully simulate. The benchmark informs your decision; it doesn't make
it for you.
We stand by our work — and where our responsibility ends
We stand by our methodology and our findings. We welcome peer review, criticism, and
contribution — that's why everything is open. But: Kumuao publishes a
measurement of model behaviour, not a personalised recommendation for any specific child or
family. We are not liable for decisions made on the basis of leaderboard results, for
misuse of our test cases (including any attempt to harden a model against
the safety patterns we measure), for misinterpretation of grades or
dimensions, or for outcomes produced by any AI we score. Pair our data with your own judgment
and, when warranted, the advice of licensed professionals. Questions, corrections, and
partnership inquiries: kumuao.ai@gmail.com.