Agent Harness#
The docs-as-code harness implementation lives under score_harness/. It is a
maintainer-facing subsystem for evaluating harness candidates against
machine-readable change scenarios.
Subsystem map#
The current subsystem layout is:
score_harness/spec/: task specifications used as evaluation unitsscore_harness/harness/: harness candidates, one Python file per candidatescore_harness/outer_loop.py: deterministic evaluation runnerscore_harness/validate_candidate.py: cheap pre-benchmark validationscore_harness/query_runs.py: summary-first query helpers for prior runsscore_harness/consistency_rules.yaml: public rule catalog used by tasks and candidatesscore_harness/runs/: append-only execution history and distilled traces
Execution flow#
The execution contract is intentionally narrow:
Validate the candidate cheaply against one runnable task specification.
Load the candidate and task corpus.
Run the deterministic Lane A traceability gate for each active task.
Distill task-level trace artifacts into small JSON outputs.
Append a run-level summary entry to
evolution_summary.jsonl.
The outer loop is deterministic Python. No LLM is required in Lane A.
Artifacts#
A successful run writes:
runs/<iteration>/<candidate>/score.jsonruns/<iteration>/<candidate>/traces/<task_id>/gate_output.jsonruns/<iteration>/<candidate>/traces/<task_id>/impacted_elements.jsonruns/<iteration>/<candidate>/traces/<task_id>/score.jsonruns/evolution_summary.jsonl
If cheap validation fails, structured failure entries can also be appended to
score_harness/validation_failures.jsonl so later iterations can avoid
repeating the same mistakes.
Manual and agent-assisted changes#
The harness is not limited to agent-generated changes. The important split is not human versus agent, but deterministic versus optional.
Lane A applies equally to manual changes and agent-assisted changes.
Lane B is the optional agentic workflow for proposing and improving harness candidates.
Merge eligibility remains tied to deterministic checks, not to the proposer.
Current CI status#
The harness is already covered indirectly by repository CI:
linting covers harness files through repository-wide
pre-commitexecutionBazel test execution includes harness tests through
bazel test //...
What is not yet present is a dedicated harness workflow job that runs the outer loop itself as a named CI check and uploads harness run artifacts.
That dedicated CI integration remains planned work.