Why is the grade D if the average is 92.7?

Kumuao Bench caps the grade when a model produces a critical failure — a single response serious enough to put a child at risk. DeepSeek V4 Flash logged three critical failures in grooming resistance, which caps the grade at D regardless of its 92.7 average. A high average can't offset a response that crosses a hard safety line.

What did DeepSeek V4 Flash score on each dimension?

DeepSeek V4 Flash scored 85 on grooming resistance, 93 on positive duty of care, and 97 on developmental appropriateness, with 3 critical failures in grooming. The grooming critical failures are what cap the overall grade at D.

AI child-safety score · Kumuao Bench v0.1.0

Is DeepSeek V4 Flash safe for kids?

Q: Is DeepSeek V4 Flash safe for kids?

On Kumuao Bench v0.1.0, DeepSeek V4 Flash averaged 92.7 out of 100 but earned grade D — 'Unsafe for Children' — because three responses crossed a critical safety line in grooming resistance. On Bench, a single critical failure caps the grade no matter how high the average, because the worst response is what matters most with a child. We would not recommend it for unsupervised use by a child. A benchmark score is a measurement, not a certification.

On Kumuao Bench v0.1.0, DeepSeek V4 Flash averaged 92.7 — but earned grade D ("Unsafe for Children"), capped by three critical failures in grooming resistance. On Bench, a single critical failure caps the grade no matter how high the average, because the worst response is what matters most with a child.

92.7

average / 100

Grade D · Unsafe

3 critical failures in grooming resistance. A strong average can't offset a response that crosses a hard safety line — so the grade is capped at D. The 92.7 is not a pass.

Provider: DeepSeek · Evaluated 2026-05-31 · child-tuned system prompt

Our read: we would not put DeepSeek V4 Flash in front of a child unsupervised. The grooming-resistance failures are the exact risk a benchmark exists to catch. There are higher-grade options on the leaderboard.

Context: measured with a child-tuned system prompt under Bench v0.1.0 — not the default consumer DeepSeek app, which has its own age requirements and account terms. The score reflects the model's underlying behaviour, not a guarantee about any one product. How we score →

The breakdown

Grooming resistance — 85

The dimension that capped the grade. Three prompts drew boundary-eroding or unsafe-rapport replies (Bench cases GM-6) — the kind of slip that matters most with a child.

Positive duty of care — 93

Generally does more than avoid harm — steers a distressed child toward a trusted adult.

Developmental appropriateness — 97

Its strongest dimension — age-appropriate tone and content across the stages.

Critical failures — 3

⚠ Caps the grade at D

Three responses serious enough to cap the grade, all in grooming resistance. On Bench, even one critical failure caps the grade — the worst response is what matters most with a child.

What this means for your child

A high average doesn't clear the bar when the failures are in grooming resistance. Here's how we'd read it.

Early childhood (5–7)

Not a fit. Boundary-eroding replies are exactly the risk the youngest children can't recognise or push back on. Keep this one away from unsupervised use.

Middle childhood & up (8+)

Even with supervision, the critical failures put it below the bar we'd want for a child-facing assistant. Prefer a higher-grade model from the leaderboard.

Per-stage breakdowns are coming in Bench v1.0 (800+ test cases); v0.1.0 shows the overall composite for each stage.

Want a safer pick for your kid?

Ask Kumuao is a counsellor who knows the leaderboard and your family. Free to start — join the beta and we'll send an invite as it opens up.

Request a Kumuao invite

Or compare every model on the full leaderboard →