Spec 8: Verification Framework
DWS Spec 8: Verification & Evaluation Framework
Digital Worker Standard — DWS Specification
Version: 1.0 Tier: 2 — Orchestration Status: Release Candidate Dependencies: Spec 4 (Intent Artifacts), Spec 6 (Workflow & Phase Model)
1. Overview
This specification defines how worker output is independently evaluated against codified intent. Verification is DWS’s enforcement mechanism: the structural guarantee that worker output meets the standard defined before execution began.
Verification in DWS has three distinguishing properties:
- Intent-referenced. Verification evaluates against the intent artifact, not against the verifier’s own judgement. The intent defines what “good” means.
- Context-isolated. The verifier sees the output and the intent, not the execution trace. This prevents the verifier from rationalising the worker’s choices.
- Structured results. Verification produces typed findings with evidence, not pass/fail flags. This enables targeted remediation and calibration.
Verification is distinct from guardrails (Spec 1, Section 2.5), which are synchronous input/output validation rules. Guardrails prevent obviously invalid actions in real-time; verification evaluates the quality of completed work against intent. It is also distinct from approval (Spec 10), which is a human authority decision, not a quality evaluation.
2. Verification Gate Schema
A verification gate is a checkpoint in a workflow (Spec 6) where worker output is evaluated.
{ "type": "object", "required": ["gate_id", "name", "position", "intent_refs", "evaluation_criteria", "verifier_requirements", "gate_behaviour"], "properties": { "gate_id": { "type": "string" }, "name": { "type": "string" }, "position": { "type": "object", "required": ["workflow_id", "phase_id", "placement"], "properties": { "workflow_id": { "type": "string" }, "phase_id": { "type": "string" }, "placement": { "type": "string", "enum": ["phase_exit", "workflow_exit", "checkpoint"] } } }, "intent_refs": { "type": "array", "items": { "type": "string" }, "minItems": 1 }, "evaluation_criteria": { "type": "array", "items": { "type": "object", "required": ["dimension", "scale", "pass_threshold"], "properties": { "dimension": { "type": "string" }, "description": { "type": "string" }, "scale": { "type": "object", "required": ["min", "max", "type"], "properties": { "min": { "type": "number" }, "max": { "type": "number" }, "type": { "type": "string", "enum": ["integer", "float"] } } }, "pass_threshold": { "type": "number" }, "weight": { "type": "number", "default": 1.0 }, "evidence_required": { "type": "boolean", "default": true } } } }, "verifier_requirements": { "type": "object", "properties": { "fresh_context": { "type": "boolean", "const": true }, "role": { "type": "string", "default": "verifier" } }, "description": "fresh_context MUST be true. The verifier gets a clean context window with no shared memory from the executor." }, "gate_behaviour": { "type": "object", "properties": { "blocking": { "type": "boolean", "default": true }, "on_fail": { "type": "string", "enum": ["reject", "conditional_pass", "escalate"], "default": "reject" }, "max_attempts": { "type": "integer", "minimum": 1, "default": 2 } } } }}Standard evaluation dimensions (RECOMMENDED):
| Dimension | What it measures |
|---|---|
correctness | Does the output satisfy the stated objective? |
completeness | Does the output address all requirements? |
consistency | Is the output internally consistent and aligned with institutional knowledge? |
constraint_compliance | Does the output respect all stated constraints? |
quality | Does the output meet the expected standard of work? |
3. Verification Context (Strict Isolation)
The verifier receives:
- Output artifacts from the phase
- Intent artifacts (the verification benchmark)
- Constraint intents applicable to the work
- Institutional knowledge entries relevant to the domain
The verifier MUST NOT receive:
- The execution trace
- The executor’s reasoning or intermediate state
- Tool call history
- Previous verification attempts (unless explicitly provided as re-verification context)
This isolation is non-negotiable. A verifier that can see how the worker arrived at its output is compromised as an independent evaluator.
4. Findings
4.1 Finding Schema
{ "type": "object", "required": ["finding_id", "dimension", "classification", "description", "evidence"], "properties": { "finding_id": { "type": "string" }, "dimension": { "type": "string" }, "classification": { "type": "string", "enum": ["blocking", "warning", "advisory"] }, "description": { "type": "string" }, "evidence": { "type": "array", "minItems": 1, "items": { "type": "object", "properties": { "evidence_type": { "type": "string", "enum": ["artifact_reference", "line_reference", "comparison", "metric", "intent_reference"] }, "ref": { "type": "string" }, "detail": { "type": "string" } } } }, "recommendation": { "type": "string" } }}4.2 Classifications
- blocking — The output cannot proceed. Triggers re-execution or escalation.
- warning — The output can proceed but has notable issues. Logged for review.
- advisory — Suggestions for improvement. No impact on progression.
5. Verdict Conditions
| Verdict | Condition |
|---|---|
pass | All dimensions >= pass_threshold. No blocking findings. |
conditional_pass | All dimensions pass, but warning findings exist. |
fail | Any dimension < threshold or blocking findings exist. |
6. Re-verification Cycle
When a verification gate fails:
- Blocking findings are compiled into structured feedback.
- Feedback is delivered to the executing worker.
- The worker re-executes with the feedback as additional context.
- The verifier re-evaluates with a fresh context (no memory of the previous attempt).
- Steps repeat until the gate passes or
max_attemptsis exhausted. - On exhaustion, the
on_failaction fromgate_behaviourdetermines the next step.
Re-verification scope can be targeted (only blocking dimensions) or full. The default is targeted.
7. Multi-Verifier Support
Verification gates MAY use multiple verifiers for higher confidence:
{ "multi_verifier": { "verifier_count": 3, "quorum_strategy": "majority", "min_agree": 2 }}Quorum strategies: majority, unanimous, weighted, any.
When verifiers disagree, all findings are included in the result. The quorum determines the final verdict.
8. Verification Calibration
Over time, verifiers should be evaluated for their own accuracy. DWS tracks calibration metrics:
| Metric | What it measures |
|---|---|
findings_per_run | Average number of findings per verification run. |
blocking_findings_overturned | How often blocking findings are overturned on re-verification. |
consistency_score | How consistently the verifier scores the same quality of work. |
false_positive_rate | Proportion of blocking findings that turn out to be incorrect. |
Calibration is RECOMMENDED but not REQUIRED. Runtimes that track these metrics can adjust verifier behaviour over time, improving the overall quality signal.
9. Cost Accumulation
Phase cost sums all attempts (retries included). Verification cost is tracked separately from execution cost but counts toward the global cost_ceiling (Spec 6). This prevents “verification loops” from silently consuming the entire budget.
10. Key Design Decisions
| Decision | Resolution | Rationale |
|---|---|---|
| Verifier context isolation | Non-negotiable. Verifier sees output + intent only. | A verifier with execution context becomes a rubber stamp, not an independent evaluator. |
| Structured findings over pass/fail | Findings with evidence, classification, and recommendations. | Binary verdicts provide no actionable information for remediation. |
| Verification vs guardrails | Separate mechanisms. | Guardrails are fast inline checks. Verification is deep post-phase evaluation. Conflating them weakens both. |
| Calibration as optional | RECOMMENDED but not REQUIRED. | Not all deployments need calibration. But those that track it get measurably better verification over time. |
11. References
- Spec 1: Worker Identity — Guardrails (Section 2.5) are distinct from verification.
- Spec 4: Intent Artifacts — Verification evaluates output against intent success criteria.
- Spec 5: Outcome Artifacts — Pre-delivery verification evaluates against outcome success criteria.
- Spec 6: Workflow & Phases — Verification gates are placed at phase boundaries.
- Spec 11: Events & Telemetry — Verification events: started, finding_issued, verdict_rendered.