Files

teknium1 1709776120 test(tool-search): add live A/B harness, drop checked-in transcripts

Brings in the tool_search live-test harness from the original PR but leaves
out the 11 checked-in scripts/out/*.json transcript files — those are
non-deterministic model output that goes stale the moment the model changes
and were the bulk of the diff. scripts/out/ is now gitignored so a harness
run never re-commits them.

Fixes on top:
- API-key loading goes through hermes_cli.env_loader.load_hermes_dotenv
  instead of hand-parsing ~/.hermes/.env and assigning the value to a local.
  The canonical loader never materializes the secret in a local variable in
  this module, which clears the four CodeQL high alerts
  (py/clear-text-storage / py/clear-text-logging-sensitive-data at the
  transcript write/print sites — they were tracing the key from the
  hand-rolled parser into the records) and removes a hand-rolled parser.
- encoding='utf-8' on every write_text/read_text in both harness scripts
  (Windows-footgun hygiene).

Co-authored-by: teknium1 <127238744+teknium1@users.noreply.github.com>

2026-05-29 02:04:12 -07:00

1.7 KiB

Raw Blame History

Tool Search live test harness

Runs five scenarios against a real model (Claude Haiku 4.5 via OpenRouter) to verify that the bridge tools work end-to-end. Records transcripts in scripts/out/.

Running

cd <repo root>
python3 scripts/tool_search_livetest.py        # runs all 5 scenarios x 2 modes
python3 scripts/analyze_livetest.py            # side-by-side report

Requires OPENROUTER_API_KEY set or present in ~/.hermes/.env.

What it verifies

Scenario	Tests
A obvious_single	BM25 retrieval on an obvious tool name (github_create_issue)
B vague_paraphrased	Retrieval when the model has to paraphrase ("schedule meeting" → evt_create)
C multi_tool_chain	Multi-step task chaining two deferred tools (GitHub + Slack)
D core_plus_deferred	Mixed: core tool (read_file) called directly, deferred tool (Slack) via bridge
E no_tool_needed	Pure-knowledge prompt; verify no spurious tool_search invocations

Each scenario runs with tool_search.enabled = on and again with off for an A/B baseline. The harness records:

bridge_calls (the tool_search / tool_describe / tool_call sequence the model emitted)
underlying_tool_calls (what actually ran through the registry dispatcher)
final_response, iteration count, elapsed time, any errors

Output structure

scripts/out/
  <scenario>__enabled.json    # tool_search ON
  <scenario>__disabled.json   # tool_search OFF
  _summary.json               # one-line summary across all runs

The 2026-05 baseline run is checked in for reference. Re-running may produce slightly different transcripts (the model is non-deterministic) but the expected_underlying_tools assertions should remain satisfied.

1.7 KiB Raw Blame History

Tool Search live test harness

Running

What it verifies

Output structure

1.7 KiB

Raw Blame History