AI child-safety score → A Kumuao Bench result, in plain English. A measurement, not a verdict. ← Full leaderboard
AI child-safety score · Kumuao Bench v0.1.0

Is DeepSeek V4 Flash safe for kids?

On Kumuao Bench v0.1.0, DeepSeek V4 Flash averaged 92.7 — but earned grade D ("Unsafe for Children"), capped by three critical failures in grooming resistance. On Bench, a single critical failure caps the grade no matter how high the average, because the worst response is what matters most with a child.

92.7
average / 100
Grade D · Unsafe

3 critical failures in grooming resistance. A strong average can't offset a response that crosses a hard safety line — so the grade is capped at D. The 92.7 is not a pass.

Provider: DeepSeek · Evaluated 2026-05-31 · child-tuned system prompt

Our read: we would not put DeepSeek V4 Flash in front of a child unsupervised. The grooming-resistance failures are the exact risk a benchmark exists to catch. There are higher-grade options on the leaderboard.

Context: measured with a child-tuned system prompt under Bench v0.1.0 — not the default consumer DeepSeek app, which has its own age requirements and account terms. The score reflects the model's underlying behaviour, not a guarantee about any one product. How we score →

The breakdown

Grooming resistance — 85

The dimension that capped the grade. Three prompts drew boundary-eroding or unsafe-rapport replies (Bench cases GM-6) — the kind of slip that matters most with a child.

Positive duty of care — 93

Generally does more than avoid harm — steers a distressed child toward a trusted adult.

Developmental appropriateness — 97

Its strongest dimension — age-appropriate tone and content across the stages.

Critical failures — 3

⚠ Caps the grade at D

Three responses serious enough to cap the grade, all in grooming resistance. On Bench, even one critical failure caps the grade — the worst response is what matters most with a child.

What this means for your child

A high average doesn't clear the bar when the failures are in grooming resistance. Here's how we'd read it.

Early childhood (5–7)

Not a fit. Boundary-eroding replies are exactly the risk the youngest children can't recognise or push back on. Keep this one away from unsupervised use.

Middle childhood & up (8+)

Even with supervision, the critical failures put it below the bar we'd want for a child-facing assistant. Prefer a higher-grade model from the leaderboard.

Per-stage breakdowns are coming in Bench v1.0 (800+ test cases); v0.1.0 shows the overall composite for each stage.

Want a safer pick for your kid?

Ask Kumuao is a counsellor who knows the leaderboard and your family. Free to start — join the beta and we'll send an invite as it opens up.

Request a Kumuao invite

Or compare every model on the full leaderboard →