Agent Harness#

The docs-as-code harness implementation lives under score_harness/. It is a maintainer-facing subsystem for evaluating harness candidates against machine-readable change scenarios.

Subsystem map#

The current subsystem layout is:

score_harness/spec/: task specifications used as evaluation units
score_harness/harness/: harness candidates, one Python file per candidate
score_harness/outer_loop.py: deterministic evaluation runner
score_harness/validate_candidate.py: cheap pre-benchmark validation
score_harness/query_runs.py: summary-first query helpers for prior runs
score_harness/consistency_rules.yaml: public rule catalog used by tasks and candidates
score_harness/runs/: append-only execution history and distilled traces

Execution flow#

The execution contract is intentionally narrow:

Validate the candidate cheaply against one runnable task specification.
Load the candidate and task corpus.
Run the deterministic Lane A traceability gate for each active task.
Distill task-level trace artifacts into small JSON outputs.
Append a run-level summary entry to evolution_summary.jsonl.

The outer loop is deterministic Python. No LLM is required in Lane A.

Artifacts#

A successful run writes:

runs/<iteration>/<candidate>/score.json
runs/<iteration>/<candidate>/traces/<task_id>/gate_output.json
runs/<iteration>/<candidate>/traces/<task_id>/impacted_elements.json
runs/<iteration>/<candidate>/traces/<task_id>/score.json
runs/evolution_summary.jsonl

If cheap validation fails, structured failure entries can also be appended to score_harness/validation_failures.jsonl so later iterations can avoid repeating the same mistakes.

Manual and agent-assisted changes#

The harness is not limited to agent-generated changes. The important split is not human versus agent, but deterministic versus optional.

Lane A applies equally to manual changes and agent-assisted changes.
Lane B is the optional agentic workflow for proposing and improving harness candidates.
Merge eligibility remains tied to deterministic checks, not to the proposer.

Current CI status#

The harness is already covered indirectly by repository CI:

linting covers harness files through repository-wide pre-commit execution
Bazel test execution includes harness tests through bazel test //...

What is not yet present is a dedicated harness workflow job that runs the outer loop itself as a named CI check and uploads harness run artifacts.

That dedicated CI integration remains planned work.