Expert Labeling Video Data How It Works Quality About Contact Start with a pilot
Primer

Why soft labels.

Some tasks do not have one honest label. Delta Evals preserves disagreement as training and evaluation signal instead of forcing false certainty. This page explains what soft labels are, when they outperform a single adjudicated label, and the schema we deliver them in.

An ~8-minute read for ML / data leads.

1. The case against single labels.

Single labels work when an item has a single answer that all qualified experts would converge on if they had enough time. For a meaningful share of the data that frontier models are trained on and evaluated against, that condition fails. Severity rating, policy violation calls, code-review judgments, design-feasibility calls, medical triage, safety classifications, preference comparisons — all of these admit multiple defensible answers from qualified experts.

Forcing a single label on items like that produces three quiet failures:

  • Information loss. The model is taught a sharp answer where the truth is fuzzy. Disagreement is destroyed at labeling time and cannot be recovered downstream.
  • Optimistic accuracy. Eval scores look better than they are because the held-out label was chosen by the same adjudication process the labels were — both anchored to the same reviewer's bias.
  • Miscalibrated reward signal. RLHF and preference models learn to output confidence shapes that match the labels. If labels are artificially confident, models become artificially confident.

2. When soft labels beat adjudication.

Adjudication — sending disputed items to a senior reviewer who picks the “real” answer — works for items where the disagreement is a labeling error: missing context, misread instructions, or a reviewer mistake. It is the wrong tool when the disagreement is honest.

For honestly ambiguous items, soft labels beat adjudication when:

  • The item has more than one defensible answer under the rubric.
  • Adjudication would simply pick the senior reviewer's prior, not a more correct answer.
  • You care about model calibration, not just accuracy.
  • You are training reward models, preference models, or safety classifiers that benefit from a soft target.
  • You are evaluating an LLM panel's distribution against a human distribution, not point estimates.

3. What soft labels are useful for.

Soft labels are not just “multiple opinions.” Treated correctly, they are a higher- resolution view of the truth. They contribute directly to:

  • Reward modeling. A target distribution lets the reward model learn the shape of human preference, not just its mode.
  • Evaluations. A model that is right with appropriate uncertainty scores higher than a model that is right with overconfidence. Soft-label evals can measure that.
  • Uncertainty modeling. When the model knows that an item has a 60/30/10 human distribution, it learns when to defer rather than commit.
  • Safety. Severity calls, policy-violation classes, and refusal decisions are exactly the place where 1-of-N hard labels are weakest. Soft labels carry the gradient between “clearly fine” and “clearly not.”
  • Model calibration. Soft targets are a direct supervisory signal for confidence calibration; hard labels are not.

4. Worked example: 4 experts split 2–1–1.

Suppose four qualified clinical reviewers grade an MRI scan for tumor severity on a three-class scale (mild / moderate / severe). They split:

  • Reviewer A: moderate — cites lesion size threshold
  • Reviewer B: moderate — cites the same threshold plus location
  • Reviewer C: severe — cites edema pattern not visible in the standard view
  • Reviewer D: mild — cites that the lesion is sub-threshold under the strict reading

A majority-vote pipeline returns “moderate” and discards everything else. Adjudication might or might not return the same answer, but it produces no information about how confident a model should be on items like this. The soft-label delivery looks like this instead:

{
  "item_id": "mri-2031",
  "task": "tumor_severity_grade",
  "distribution": { "mild": 0.25, "moderate": 0.50, "severe": 0.25 },
  "reviewer_count": 4,
  "entropy_bits": 1.50,
  "disagreement_class": "major",
  "rationales": [
    { "reviewer": "A", "label": "moderate", "evidence": "lesion 18mm > 15mm threshold" },
    { "reviewer": "B", "label": "moderate", "evidence": "size + parietal location" },
    { "reviewer": "C", "label": "severe",   "evidence": "edema visible in T2-FLAIR" },
    { "reviewer": "D", "label": "mild",     "evidence": "strict reading sub-threshold" }
  ],
  "adjudicated_label": null,
  "adjudication_reason": "honest disagreement; not collapsed by default"
}

The model now has exactly enough information to learn that this item is genuinely split, that reviewer C found a feature the others may not have seen, and that the correct shape of the answer is a distribution — not a delta function on “moderate.”

5. Output schema.

Every soft-label item ships with a fixed schema. The fields below are the contractual shape of a Delta Evals soft-label JSONL file:

  • item_id — stable identifier for the item
  • task — task name (matches the rubric YAML)
  • distribution — map of label → probability mass, summing to 1.0
  • reviewer_count — number of independent expert reviewers
  • entropy_bits — Shannon entropy of the distribution
  • confidence — per-reviewer self-reported confidence on a calibrated scale
  • disagreement_class — one of agreement / minor / major / unresolved
  • rationales — per-reviewer evidence and reasoning
  • adjudicated_label — non-null only when adjudication was applied (rare; see §7)
  • adjudication_reason — documents why the panel was or was not collapsed

The schema is shipped alongside a rubric YAML (per-criterion anchors and severity definitions) and a QA PDF that summarizes inter-reviewer agreement, gold-item pass rates, and any items flagged for re-review.

6. How we report entropy, confidence, and disagreement.

Three reporting axes ship with every batch:

  • Entropy measures uncertainty in the distribution itself: a 50/50 split has more entropy than 80/20, regardless of whether reviewers were confident in their individual choices. Useful as a per-item difficulty signal.
  • Confidence measures how sure each reviewer was in their own answer. A high-entropy item with high per-reviewer confidence is genuine ambiguity. A high-entropy item with low confidence is usually a rubric problem we can fix.
  • Disagreement class is a per-item label that puts items into a coarse taxonomy (agreement, minor, major, unresolved). It is the field most engineering teams filter on when building training and eval splits.

Together, these three give your team a defensible answer to the question “why is the panel split on this item?” for every item in the dataset.

7. When we still adjudicate to a hard label.

Soft labels are the default; they are not the only option. For some tasks a single adjudicated label is the right artifact. We adjudicate when:

  • The item has a programmatically verifiable answer (math, unit-checked physics, code that compiles, deterministic SQL).
  • The disagreement is clearly an error: a reviewer misread the prompt, used the wrong rubric, or violated a stated assumption.
  • The downstream consumer (a policy, a deterministic eval, a regulatory artifact) requires a single decision.
  • You explicitly request hard labels for an item set and accept the trade-off.

Even when we adjudicate, we preserve the full panel record — the per-reviewer label, rationale, and confidence — alongside the adjudicated answer. You never lose the soft signal; you just get an additional, recommended hard label on top of it.

8. Position.

Soft labels are a model signal, not a polite hedge. The honest information content of an ambiguous item is a distribution; reducing it to a point estimate is throwing away data the model could be using. Delta Evals delivers the distribution by default.

Not sure if your task needs hard or soft labels?

A 10-business-day pilot will tell you. We'll run 100–300 items through a 3-expert panel, ship you the distribution, and you'll see exactly how much signal a single label was leaving on the table.