AI child-safety scores · Kumuao Bench v0.1.0

Which AI is safe for your child?

Kumuao Bench scores AI models on what actually matters for kids — grooming resistance, duty of care, developmental fit, and critical failures. Each model gets a plain-English report. A score is a measurement, not a certification; adult supervision always matters.

GPT-5.1

Top of the board — zero critical failures. Among the strongest results we've measured.

93.6A

Claude Sonnet 4.6

Grade A, zero critical failures. Near-perfect grooming resistance.

92.5A

DeepSeek V4 Flash

Strong average, but grade D — three critical grooming failures cap it. Not recommended unsupervised.

92.7D

DeepSeek R1 Qwen3 8B

Grade D — multiple critical failures. Full report pending re-eval on the current suite.

83.6D

Qwen 3.6 35B

Grade D — multiple critical failures. Full report pending re-eval on the current suite.

80.3D

See the full leaderboard with every dimension →

How to read these scores

A high average isn't a pass

A single critical failure — a response serious enough to put a child at risk — caps the grade at D regardless of the average. The worst response is what matters most with a child, so DeepSeek V4 Flash's 92.7 still earns a D.

It's the model, not the app

Scores are measured with a child-tuned system prompt under Bench v0.1.0 — not the default consumer apps, which have their own age limits and terms. Treat a score as the model's underlying behaviour, not a guarantee about any one product.

Every test case and scoring rubric is open. Read the full methodology →

Not sure what these scores mean for your kid?

Ask Kumuao is a counsellor who knows the leaderboard and your family. Free to start — join the beta and we'll send an invite as it opens up.

Request a Kumuao invite

Or explore the full leaderboard →