Capabilities Domains Process Why us Services Contact Start a project
Services

A detailed look at the thirteen services Delta Evals offers.

Each service is built around a single principle: the reasoning behind a label is often worth more than the label itself. Below is what we deliver, how we deliver it, and what you can expect to receive from each engagement.

01
Annotation

Rationale-augmented labeling

We do not only assign labels; every label is paired with a concise human rationale that explains why it was chosen — including supporting evidence, assumptions, ambiguity notes, and confidence level. This is essential for subjective, ambiguous, or expert-level tasks where the reasoning behind the label matters as much as the label itself.

  • Selected label plus a structured rationale block
  • Supporting evidence pointers (text spans, image regions, timestamps)
  • Stated assumptions and counter-considerations
  • Ambiguity notes when the answer is not clear-cut
  • Annotator confidence on a calibrated scale
  • Optional second-pass rationale review for high-stakes items
02
Authoring

Expert-created STEM reasoning tasks

We design challenging engineering and STEM prompts that test deep, multi-step reasoning rather than simple recall. Tasks are authored by domain experts and span mechanical, electrical, materials, industrial, automotive, robotics, AI/ML, and applied engineering domains.

  • Multi-step problem decomposition (no single-fact lookups)
  • Realistic engineering scenarios and constraints
  • Difficulty tagging and target-skill metadata
  • Expected reasoning depth and length budget
  • Domain-appropriate notation and units
  • Originality & leakage screening
03
Verification

Complete solution traces & verification artifacts

For each expert-created task we provide the correct answer, step-by-step derivation, assumptions, equations, unit checks, boundary conditions, alternate solution paths, and verification logic to confirm whether a model's answer is valid.

  • Canonical correct answer plus tolerances
  • Full step-by-step derivation
  • Stated assumptions, units, and reference values
  • Boundary & sanity-check conditions
  • Alternate valid solution paths
  • Programmatic or rubric-based verification logic
04
Evaluation

Rubric-driven evaluation

We create detailed grading rubrics that decompose quality into measurable criteria such as factual correctness, reasoning consistency, mathematical validity, domain feasibility, clarity, completeness, safety, and severity of errors. This supports more consistent human review and more interpretable model evaluation.

  • Per-criterion scoring with anchored examples
  • Severity definitions for error types
  • Reviewer-friendly tie-breaking rules
  • Aggregated and disaggregated scoring
  • Rubric versioning and changelog
  • Calibration items embedded in rubric design
05
Adversarial

Failure-mode & blind-spot analysis

We design adversarial examples, edge cases, near-miss cases, counterfactual variants, and stress tests that expose model weaknesses in reasoning, assumptions, numerical computation, physical constraints, domain knowledge, hallucination, and overconfidence.

  • Failure-mode taxonomy tailored to your model
  • Near-miss and counterfactual prompt variants
  • Numerical and unit-trap stress tests
  • Physically impossible / ill-posed scenarios
  • Overconfidence and hallucination probes
  • Coverage report mapping prompts to failure modes
06
Annotation

Disagreement-aware & ambiguity-aware labeling

For subjective or open-ended tasks we collect multiple expert judgments, preserve annotator disagreement, document the rationale behind different interpretations, and produce soft-label or label-distribution outputs rather than forcing a single majority-vote label. Recent work shows annotator disagreement carries valuable information that should not be treated as noise.

  • Multi-annotator collection per item
  • Soft-label and label-distribution outputs
  • Per-interpretation rationale documentation
  • Reasonable-disagreement vs. error separation
  • Optional adjudicated single-label view
  • Disagreement metadata for downstream training
07
Workflow

Human–LLM collaborative annotation

We use LLMs to generate draft labels, rationales, or candidate solutions, then have human experts verify, correct, reject, or improve them. This allows faster iteration while keeping expert oversight at the center of quality control.

  • LLM-drafted labels and rationales
  • Expert verification, correction, or rejection
  • Edit-distance and override tracking
  • Disagreement triggers full human re-do
  • Prompt and model version capture
  • Throughput vs. quality dashboards
08
Strategy

Active-learning-driven labeling

We prioritize the most informative, uncertain, diverse, or high-impact samples for expert review instead of labeling data randomly. This reduces annotation cost while improving the value of every expert-labeled example.

  • Uncertainty & disagreement-based sampling
  • Diversity and coverage-based sampling
  • Impact-weighted prioritization
  • Loop-in-loop with model retraining
  • Budget-aware batching
  • Sample-selection rationale logged
09
Evaluation

Preference ranking & comparative evaluation

We evaluate multiple model outputs through pairwise preference ranking, best-of-N selection, rubric-based scoring, and explanation-based review — including notes on why one response is more correct, safer, clearer, or more useful than another.

  • Pairwise A/B preferences with rationale
  • Best-of-N selection with justification
  • Rubric scoring across multiple criteria
  • Tie-handling and "neither acceptable" handling
  • Position-bias controls
  • Inter-rater agreement on preference
10
Multimodal

Multimodal annotation with evidence grounding

We support text, image, audio, video, and multimodal labeling with evidence spans, bounding boxes, timestamps, object/event descriptions, visual reasoning notes, and cross-modal consistency checks.

  • Text spans & entity grounding
  • Bounding boxes, polygons, segmentation
  • Audio & video timestamps with event descriptions
  • Cross-modal consistency checks
  • Visual reasoning rationales
  • Schema-validated output formats
11
Quality

Reviewer calibration & quality assurance

We build reviewer workflows with qualification tasks, calibration rounds, gold-standard questions, blind review, inter-reviewer agreement tracking, adjudication, label-error audits, and continuous feedback loops.

  • Qualification & calibration tasks before production
  • Embedded gold items and pass-rate monitoring
  • Blind multi-pass review
  • Inter-reviewer agreement (Cohen's κ, Krippendorff's α)
  • Adjudication for hard or contested items
  • Label-error audits with corrective re-work
12
Systems

Dataset & benchmark quality systems

We support annotation guideline development, rubric refinement, dataset documentation, leakage checks, difficulty tagging, task metadata, model failure taxonomies, and version-controlled quality reviews.

  • Annotation guideline authoring & versioning
  • Rubric refinement cycles
  • Dataset documentation (datasheets / model cards)
  • Train/eval leakage checks
  • Difficulty tagging and topic metadata
  • Version-controlled releases & changelogs
13
Neuroscience

Neuroscience annotation

Specialized neuroscience annotation services for EEG, MRI, brain-tumor pathology, and multimodal research datasets. We create expert-reviewed labels for seizure and sleep events, neuroimaging ROIs and lesions, cell and tissue segmentation, and structured research metadata for AI training and analysis pipelines.

  • EEG event labeling: seizure, sleep stages, artifacts
  • MRI ROI delineation and lesion segmentation
  • Brain-tumor pathology: cell- and tissue-level segmentation
  • Multimodal research datasets with cross-modal alignment
  • Structured metadata for AI & analysis pipelines
  • Clinician- and neuroscientist-reviewed quality control

Ready to scope an engagement?

Send us a description of your model, the failure modes you care about, and your timeline. We'll come back with a labeling and evaluation plan.