Services

A detailed look at the thirteen services Delta Evals offers.

Each service is built around a single principle: the reasoning behind a label is often worth more than the label itself. Below is what we deliver, how we deliver it, and what you can expect to receive from each engagement.

Talk to us Jump to services

Annotation

Rationale-augmented labeling

We do not only assign labels; every label is paired with a concise human rationale that explains why it was chosen — including supporting evidence, assumptions, ambiguity notes, and confidence level. This is essential for subjective, ambiguous, or expert-level tasks where the reasoning behind the label matters as much as the label itself.

Selected label plus a structured rationale block
Supporting evidence pointers (text spans, image regions, timestamps)
Stated assumptions and counter-considerations
Ambiguity notes when the answer is not clear-cut
Annotator confidence on a calibrated scale
Optional second-pass rationale review for high-stakes items

Authoring

Expert-created STEM reasoning tasks

We design challenging engineering and STEM prompts that test deep, multi-step reasoning rather than simple recall. Tasks are authored by domain experts and span mechanical, electrical, materials, industrial, automotive, robotics, AI/ML, and applied engineering domains.

Multi-step problem decomposition (no single-fact lookups)
Realistic engineering scenarios and constraints
Difficulty tagging and target-skill metadata
Expected reasoning depth and length budget
Domain-appropriate notation and units
Originality & leakage screening

Verification

Complete solution traces & verification artifacts

For each expert-created task we provide the correct answer, step-by-step derivation, assumptions, equations, unit checks, boundary conditions, alternate solution paths, and verification logic to confirm whether a model's answer is valid.

Canonical correct answer plus tolerances
Full step-by-step derivation
Stated assumptions, units, and reference values
Boundary & sanity-check conditions
Alternate valid solution paths
Programmatic or rubric-based verification logic

Evaluation

Rubric-driven evaluation

We create detailed grading rubrics that decompose quality into measurable criteria such as factual correctness, reasoning consistency, mathematical validity, domain feasibility, clarity, completeness, safety, and severity of errors. This supports more consistent human review and more interpretable model evaluation.

Per-criterion scoring with anchored examples
Severity definitions for error types
Reviewer-friendly tie-breaking rules
Aggregated and disaggregated scoring
Rubric versioning and changelog
Calibration items embedded in rubric design

Adversarial

Failure-mode & blind-spot analysis

We design adversarial examples, edge cases, near-miss cases, counterfactual variants, and stress tests that expose model weaknesses in reasoning, assumptions, numerical computation, physical constraints, domain knowledge, hallucination, and overconfidence.

Failure-mode taxonomy tailored to your model
Near-miss and counterfactual prompt variants
Numerical and unit-trap stress tests
Physically impossible / ill-posed scenarios
Overconfidence and hallucination probes
Coverage report mapping prompts to failure modes

Annotation

Disagreement-aware & ambiguity-aware labeling

For subjective or open-ended tasks we collect multiple expert judgments, preserve annotator disagreement, document the rationale behind different interpretations, and produce soft-label or label-distribution outputs rather than forcing a single majority-vote label. Recent work shows annotator disagreement carries valuable information that should not be treated as noise.

Multi-annotator collection per item
Soft-label and label-distribution outputs
Per-interpretation rationale documentation
Reasonable-disagreement vs. error separation
Optional adjudicated single-label view
Disagreement metadata for downstream training

Workflow

Human–LLM collaborative annotation

We use LLMs to generate draft labels, rationales, or candidate solutions, then have human experts verify, correct, reject, or improve them. This allows faster iteration while keeping expert oversight at the center of quality control.

LLM-drafted labels and rationales
Expert verification, correction, or rejection
Edit-distance and override tracking
Disagreement triggers full human re-do
Prompt and model version capture
Throughput vs. quality dashboards

Strategy

Active-learning-driven labeling

We prioritize the most informative, uncertain, diverse, or high-impact samples for expert review instead of labeling data randomly. This reduces annotation cost while improving the value of every expert-labeled example.

Uncertainty & disagreement-based sampling
Diversity and coverage-based sampling
Impact-weighted prioritization
Loop-in-loop with model retraining
Budget-aware batching
Sample-selection rationale logged

Evaluation

Preference ranking & comparative evaluation

We evaluate multiple model outputs through pairwise preference ranking, best-of-N selection, rubric-based scoring, and explanation-based review — including notes on why one response is more correct, safer, clearer, or more useful than another.

Pairwise A/B preferences with rationale
Best-of-N selection with justification
Rubric scoring across multiple criteria
Tie-handling and "neither acceptable" handling
Position-bias controls
Inter-rater agreement on preference

Multimodal

Multimodal annotation with evidence grounding

We support text, image, audio, video, and multimodal labeling with evidence spans, bounding boxes, timestamps, object/event descriptions, visual reasoning notes, and cross-modal consistency checks.

Text spans & entity grounding
Bounding boxes, polygons, segmentation
Audio & video timestamps with event descriptions
Cross-modal consistency checks
Visual reasoning rationales
Schema-validated output formats

Quality

Reviewer calibration & quality assurance

We build reviewer workflows with qualification tasks, calibration rounds, gold-standard questions, blind review, inter-reviewer agreement tracking, adjudication, label-error audits, and continuous feedback loops.

Qualification & calibration tasks before production
Embedded gold items and pass-rate monitoring
Blind multi-pass review
Inter-reviewer agreement (Cohen's κ, Krippendorff's α)
Adjudication for hard or contested items
Label-error audits with corrective re-work

Systems

Dataset & benchmark quality systems

We support annotation guideline development, rubric refinement, dataset documentation, leakage checks, difficulty tagging, task metadata, model failure taxonomies, and version-controlled quality reviews.

Annotation guideline authoring & versioning
Rubric refinement cycles
Dataset documentation (datasheets / model cards)
Train/eval leakage checks
Difficulty tagging and topic metadata
Version-controlled releases & changelogs

Neuroscience

Neuroscience annotation

Specialized neuroscience annotation services for EEG, MRI, brain-tumor pathology, and multimodal research datasets. We create expert-reviewed labels for seizure and sleep events, neuroimaging ROIs and lesions, cell and tissue segmentation, and structured research metadata for AI training and analysis pipelines.

EEG event labeling: seizure, sleep stages, artifacts
MRI ROI delineation and lesion segmentation
Brain-tumor pathology: cell- and tissue-level segmentation
Multimodal research datasets with cross-modal alignment
Structured metadata for AI & analysis pipelines
Clinician- and neuroscientist-reviewed quality control

Ready to scope an engagement?

Send us a description of your model, the failure modes you care about, and your timeline. We'll come back with a labeling and evaluation plan.