Spec 8: Verification Framework

DWS Spec 8: Verification & Evaluation Framework

Digital Worker Standard — DWS Specification

Version: 1.0 Tier: 2 — Orchestration Status: Release Candidate Dependencies: Spec 4 (Intent Artifacts), Spec 6 (Workflow & Phase Model)

1. Overview

This specification defines how worker output is independently evaluated against codified intent. Verification is DWS’s enforcement mechanism: the structural guarantee that worker output meets the standard defined before execution began.

Verification in DWS has three distinguishing properties:

Intent-referenced. Verification evaluates against the intent artifact, not against the verifier’s own judgement. The intent defines what “good” means.
Context-isolated. The verifier sees the output and the intent, not the execution trace. This prevents the verifier from rationalising the worker’s choices.
Structured results. Verification produces typed findings with evidence, not pass/fail flags. This enables targeted remediation and calibration.

Verification is distinct from guardrails (Spec 1, Section 2.5), which are synchronous input/output validation rules. Guardrails prevent obviously invalid actions in real-time; verification evaluates the quality of completed work against intent. It is also distinct from approval (Spec 10), which is a human authority decision, not a quality evaluation.

2. Verification Gate Schema

A verification gate is a checkpoint in a workflow (Spec 6) where worker output is evaluated.

{
  "type": "object",
  "required": ["gate_id", "name", "position", "intent_refs", "evaluation_criteria", "verifier_requirements", "gate_behaviour"],
  "properties": {
    "gate_id": { "type": "string" },
    "name": { "type": "string" },
    "position": {
      "type": "object",
      "required": ["workflow_id", "phase_id", "placement"],
      "properties": {
        "workflow_id": { "type": "string" },
        "phase_id": { "type": "string" },
        "placement": { "type": "string", "enum": ["phase_exit", "workflow_exit", "checkpoint"] }
      }
    },
    "intent_refs": {
      "type": "array",
      "items": { "type": "string" },
      "minItems": 1
    },
    "evaluation_criteria": {
      "type": "array",
      "items": {
        "type": "object",
        "required": ["dimension", "scale", "pass_threshold"],
        "properties": {
          "dimension": { "type": "string" },
          "description": { "type": "string" },
          "scale": {
            "type": "object",
            "required": ["min", "max", "type"],
            "properties": {
              "min": { "type": "number" },
              "max": { "type": "number" },
              "type": { "type": "string", "enum": ["integer", "float"] }
            }
          },
          "pass_threshold": { "type": "number" },
          "weight": { "type": "number", "default": 1.0 },
          "evidence_required": { "type": "boolean", "default": true }
        }
      }
    },
    "verifier_requirements": {
      "type": "object",
      "properties": {
        "fresh_context": { "type": "boolean", "const": true },
        "role": { "type": "string", "default": "verifier" }
      },
      "description": "fresh_context MUST be true. The verifier gets a clean context window with no shared memory from the executor."
    },
    "gate_behaviour": {
      "type": "object",
      "properties": {
        "blocking": { "type": "boolean", "default": true },
        "on_fail": { "type": "string", "enum": ["reject", "conditional_pass", "escalate"], "default": "reject" },
        "max_attempts": { "type": "integer", "minimum": 1, "default": 2 }
      }
    }
  }
}

Standard evaluation dimensions (RECOMMENDED):

Dimension	What it measures
`correctness`	Does the output satisfy the stated objective?
`completeness`	Does the output address all requirements?
`consistency`	Is the output internally consistent and aligned with institutional knowledge?
`constraint_compliance`	Does the output respect all stated constraints?
`quality`	Does the output meet the expected standard of work?

3. Verification Context (Strict Isolation)

The verifier receives:

Output artifacts from the phase
Intent artifacts (the verification benchmark)
Constraint intents applicable to the work
Institutional knowledge entries relevant to the domain

The verifier MUST NOT receive:

The execution trace
The executor’s reasoning or intermediate state
Tool call history
Previous verification attempts (unless explicitly provided as re-verification context)

This isolation is non-negotiable. A verifier that can see how the worker arrived at its output is compromised as an independent evaluator.

4. Findings

4.1 Finding Schema

{
  "type": "object",
  "required": ["finding_id", "dimension", "classification", "description", "evidence"],
  "properties": {
    "finding_id": { "type": "string" },
    "dimension": { "type": "string" },
    "classification": { "type": "string", "enum": ["blocking", "warning", "advisory"] },
    "description": { "type": "string" },
    "evidence": {
      "type": "array",
      "minItems": 1,
      "items": {
        "type": "object",
        "properties": {
          "evidence_type": { "type": "string", "enum": ["artifact_reference", "line_reference", "comparison", "metric", "intent_reference"] },
          "ref": { "type": "string" },
          "detail": { "type": "string" }
        }
      }
    },
    "recommendation": { "type": "string" }
  }
}

4.2 Classifications

blocking — The output cannot proceed. Triggers re-execution or escalation.
warning — The output can proceed but has notable issues. Logged for review.
advisory — Suggestions for improvement. No impact on progression.

5. Verdict Conditions

Verdict	Condition
`pass`	All dimensions >= pass_threshold. No blocking findings.
`conditional_pass`	All dimensions pass, but warning findings exist.
`fail`	Any dimension < threshold or blocking findings exist.

6. Re-verification Cycle

When a verification gate fails:

Blocking findings are compiled into structured feedback.
Feedback is delivered to the executing worker.
The worker re-executes with the feedback as additional context.
The verifier re-evaluates with a fresh context (no memory of the previous attempt).
Steps repeat until the gate passes or max_attempts is exhausted.
On exhaustion, the on_fail action from gate_behaviour determines the next step.

Re-verification scope can be targeted (only blocking dimensions) or full. The default is targeted.

7. Multi-Verifier Support

Verification gates MAY use multiple verifiers for higher confidence:

{
  "multi_verifier": {
    "verifier_count": 3,
    "quorum_strategy": "majority",
    "min_agree": 2
  }
}

Quorum strategies: majority, unanimous, weighted, any.

When verifiers disagree, all findings are included in the result. The quorum determines the final verdict.

8. Verification Calibration

Over time, verifiers should be evaluated for their own accuracy. DWS tracks calibration metrics:

Metric	What it measures
`findings_per_run`	Average number of findings per verification run.
`blocking_findings_overturned`	How often blocking findings are overturned on re-verification.
`consistency_score`	How consistently the verifier scores the same quality of work.
`false_positive_rate`	Proportion of blocking findings that turn out to be incorrect.

Calibration is RECOMMENDED but not REQUIRED. Runtimes that track these metrics can adjust verifier behaviour over time, improving the overall quality signal.

9. Cost Accumulation

Phase cost sums all attempts (retries included). Verification cost is tracked separately from execution cost but counts toward the global cost_ceiling (Spec 6). This prevents “verification loops” from silently consuming the entire budget.

10. Key Design Decisions

Decision	Resolution	Rationale
Verifier context isolation	Non-negotiable. Verifier sees output + intent only.	A verifier with execution context becomes a rubber stamp, not an independent evaluator.
Structured findings over pass/fail	Findings with evidence, classification, and recommendations.	Binary verdicts provide no actionable information for remediation.
Verification vs guardrails	Separate mechanisms.	Guardrails are fast inline checks. Verification is deep post-phase evaluation. Conflating them weakens both.
Calibration as optional	RECOMMENDED but not REQUIRED.	Not all deployments need calibration. But those that track it get measurably better verification over time.

11. References

Spec 1: Worker Identity — Guardrails (Section 2.5) are distinct from verification.
Spec 4: Intent Artifacts — Verification evaluates output against intent success criteria.
Spec 5: Outcome Artifacts — Pre-delivery verification evaluates against outcome success criteria.
Spec 6: Workflow & Phases — Verification gates are placed at phase boundaries.
Spec 11: Events & Telemetry — Verification events: started, finding_issued, verdict_rendered.