Skip to content

Spec 8: Verification Framework

DWS Spec 8: Verification & Evaluation Framework

Digital Worker Standard — DWS Specification

Version: 1.0 Tier: 2 — Orchestration Status: Release Candidate Dependencies: Spec 4 (Intent Artifacts), Spec 6 (Workflow & Phase Model)


1. Overview

This specification defines how worker output is independently evaluated against codified intent. Verification is DWS’s enforcement mechanism: the structural guarantee that worker output meets the standard defined before execution began.

Verification in DWS has three distinguishing properties:

  1. Intent-referenced. Verification evaluates against the intent artifact, not against the verifier’s own judgement. The intent defines what “good” means.
  2. Context-isolated. The verifier sees the output and the intent, not the execution trace. This prevents the verifier from rationalising the worker’s choices.
  3. Structured results. Verification produces typed findings with evidence, not pass/fail flags. This enables targeted remediation and calibration.

Verification is distinct from guardrails (Spec 1, Section 2.5), which are synchronous input/output validation rules. Guardrails prevent obviously invalid actions in real-time; verification evaluates the quality of completed work against intent. It is also distinct from approval (Spec 10), which is a human authority decision, not a quality evaluation.


2. Verification Gate Schema

A verification gate is a checkpoint in a workflow (Spec 6) where worker output is evaluated.

{
"type": "object",
"required": ["gate_id", "name", "position", "intent_refs", "evaluation_criteria", "verifier_requirements", "gate_behaviour"],
"properties": {
"gate_id": { "type": "string" },
"name": { "type": "string" },
"position": {
"type": "object",
"required": ["workflow_id", "phase_id", "placement"],
"properties": {
"workflow_id": { "type": "string" },
"phase_id": { "type": "string" },
"placement": { "type": "string", "enum": ["phase_exit", "workflow_exit", "checkpoint"] }
}
},
"intent_refs": {
"type": "array",
"items": { "type": "string" },
"minItems": 1
},
"evaluation_criteria": {
"type": "array",
"items": {
"type": "object",
"required": ["dimension", "scale", "pass_threshold"],
"properties": {
"dimension": { "type": "string" },
"description": { "type": "string" },
"scale": {
"type": "object",
"required": ["min", "max", "type"],
"properties": {
"min": { "type": "number" },
"max": { "type": "number" },
"type": { "type": "string", "enum": ["integer", "float"] }
}
},
"pass_threshold": { "type": "number" },
"weight": { "type": "number", "default": 1.0 },
"evidence_required": { "type": "boolean", "default": true }
}
}
},
"verifier_requirements": {
"type": "object",
"properties": {
"fresh_context": { "type": "boolean", "const": true },
"role": { "type": "string", "default": "verifier" }
},
"description": "fresh_context MUST be true. The verifier gets a clean context window with no shared memory from the executor."
},
"gate_behaviour": {
"type": "object",
"properties": {
"blocking": { "type": "boolean", "default": true },
"on_fail": { "type": "string", "enum": ["reject", "conditional_pass", "escalate"], "default": "reject" },
"max_attempts": { "type": "integer", "minimum": 1, "default": 2 }
}
}
}
}

Standard evaluation dimensions (RECOMMENDED):

DimensionWhat it measures
correctnessDoes the output satisfy the stated objective?
completenessDoes the output address all requirements?
consistencyIs the output internally consistent and aligned with institutional knowledge?
constraint_complianceDoes the output respect all stated constraints?
qualityDoes the output meet the expected standard of work?

3. Verification Context (Strict Isolation)

The verifier receives:

  • Output artifacts from the phase
  • Intent artifacts (the verification benchmark)
  • Constraint intents applicable to the work
  • Institutional knowledge entries relevant to the domain

The verifier MUST NOT receive:

  • The execution trace
  • The executor’s reasoning or intermediate state
  • Tool call history
  • Previous verification attempts (unless explicitly provided as re-verification context)

This isolation is non-negotiable. A verifier that can see how the worker arrived at its output is compromised as an independent evaluator.


4. Findings

4.1 Finding Schema

{
"type": "object",
"required": ["finding_id", "dimension", "classification", "description", "evidence"],
"properties": {
"finding_id": { "type": "string" },
"dimension": { "type": "string" },
"classification": { "type": "string", "enum": ["blocking", "warning", "advisory"] },
"description": { "type": "string" },
"evidence": {
"type": "array",
"minItems": 1,
"items": {
"type": "object",
"properties": {
"evidence_type": { "type": "string", "enum": ["artifact_reference", "line_reference", "comparison", "metric", "intent_reference"] },
"ref": { "type": "string" },
"detail": { "type": "string" }
}
}
},
"recommendation": { "type": "string" }
}
}

4.2 Classifications

  • blocking — The output cannot proceed. Triggers re-execution or escalation.
  • warning — The output can proceed but has notable issues. Logged for review.
  • advisory — Suggestions for improvement. No impact on progression.

5. Verdict Conditions

VerdictCondition
passAll dimensions >= pass_threshold. No blocking findings.
conditional_passAll dimensions pass, but warning findings exist.
failAny dimension < threshold or blocking findings exist.

6. Re-verification Cycle

When a verification gate fails:

  1. Blocking findings are compiled into structured feedback.
  2. Feedback is delivered to the executing worker.
  3. The worker re-executes with the feedback as additional context.
  4. The verifier re-evaluates with a fresh context (no memory of the previous attempt).
  5. Steps repeat until the gate passes or max_attempts is exhausted.
  6. On exhaustion, the on_fail action from gate_behaviour determines the next step.

Re-verification scope can be targeted (only blocking dimensions) or full. The default is targeted.


7. Multi-Verifier Support

Verification gates MAY use multiple verifiers for higher confidence:

{
"multi_verifier": {
"verifier_count": 3,
"quorum_strategy": "majority",
"min_agree": 2
}
}

Quorum strategies: majority, unanimous, weighted, any.

When verifiers disagree, all findings are included in the result. The quorum determines the final verdict.


8. Verification Calibration

Over time, verifiers should be evaluated for their own accuracy. DWS tracks calibration metrics:

MetricWhat it measures
findings_per_runAverage number of findings per verification run.
blocking_findings_overturnedHow often blocking findings are overturned on re-verification.
consistency_scoreHow consistently the verifier scores the same quality of work.
false_positive_rateProportion of blocking findings that turn out to be incorrect.

Calibration is RECOMMENDED but not REQUIRED. Runtimes that track these metrics can adjust verifier behaviour over time, improving the overall quality signal.


9. Cost Accumulation

Phase cost sums all attempts (retries included). Verification cost is tracked separately from execution cost but counts toward the global cost_ceiling (Spec 6). This prevents “verification loops” from silently consuming the entire budget.


10. Key Design Decisions

DecisionResolutionRationale
Verifier context isolationNon-negotiable. Verifier sees output + intent only.A verifier with execution context becomes a rubber stamp, not an independent evaluator.
Structured findings over pass/failFindings with evidence, classification, and recommendations.Binary verdicts provide no actionable information for remediation.
Verification vs guardrailsSeparate mechanisms.Guardrails are fast inline checks. Verification is deep post-phase evaluation. Conflating them weakens both.
Calibration as optionalRECOMMENDED but not REQUIRED.Not all deployments need calibration. But those that track it get measurably better verification over time.

11. References

  • Spec 1: Worker Identity — Guardrails (Section 2.5) are distinct from verification.
  • Spec 4: Intent Artifacts — Verification evaluates output against intent success criteria.
  • Spec 5: Outcome Artifacts — Pre-delivery verification evaluates against outcome success criteria.
  • Spec 6: Workflow & Phases — Verification gates are placed at phase boundaries.
  • Spec 11: Events & Telemetry — Verification events: started, finding_issued, verdict_rendered.