[feat] Add support for azd ai agent eval + optimize#8306
Conversation
There was a problem hiding this comment.
Pull request overview
Adds first-class eval and optimize workflows to the azure.ai.agents azd extension, including new API clients/models, polling helpers, and CLI commands for running, listing, showing, updating, and deploying optimization/evaluation assets.
Changes:
- Introduces Optimize API client/models + polling loop, with corresponding unit tests.
- Introduces Eval API client/helpers (portal URLs, polling, generation request helpers, config parsing/writing) + unit tests.
- Adds new CLI command groups (
azd ai agent eval,azd ai agent optimize) and integrates optimization deployment reporting into post-deploy.
Reviewed changes
Copilot reviewed 62 out of 62 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| cli/azd/extensions/azure.ai.agents/internal/project/service_target_agent.go | Adds a deploy note pointing users to eval initialization. |
| cli/azd/extensions/azure.ai.agents/internal/pkg/agents/optimize_api/poller.go | New polling loop for optimization job status. |
| cli/azd/extensions/azure.ai.agents/internal/pkg/agents/optimize_api/poller_test.go | Tests optimization polling behavior. |
| cli/azd/extensions/azure.ai.agents/internal/pkg/agents/optimize_api/models.go | Adds optimization request/response models and status helpers. |
| cli/azd/extensions/azure.ai.agents/internal/pkg/agents/optimize_api/models_test.go | JSON round-trip tests for optimization models. |
| cli/azd/extensions/azure.ai.agents/internal/pkg/agents/optimize_api/client.go | Adds Optimize API HTTP client for submit/status/list/cancel/promote/candidate fetch. |
| cli/azd/extensions/azure.ai.agents/internal/pkg/agents/optimize_api/client_test.go | Tests Optimize API client behaviors and error handling. |
| cli/azd/extensions/azure.ai.agents/internal/pkg/agents/opteval/yaml_test.go | Adds YAML round-trip tests for shared opt/eval config structures. |
| cli/azd/extensions/azure.ai.agents/internal/pkg/agents/opteval/state.go | Persists eval init/run state across invocations via azd environment keys. |
| cli/azd/extensions/azure.ai.agents/internal/pkg/agents/eval_api/portal_urls.go | Constructs Foundry portal URLs for eval and optimization artifacts. |
| cli/azd/extensions/azure.ai.agents/internal/pkg/agents/eval_api/poller.go | Adds a generalized poller for eval generation jobs with typed status/errors. |
| cli/azd/extensions/azure.ai.agents/internal/pkg/agents/eval_api/poller_test.go | Tests eval poller status/timeout/cancel semantics. |
| cli/azd/extensions/azure.ai.agents/internal/pkg/agents/eval_api/operations.go | Adds Eval API HTTP operations with a generic typed request helper. |
| cli/azd/extensions/azure.ai.agents/internal/pkg/agents/eval_api/generation.go | Helper builders for generation sources, evaluator classification, dataset-name detection. |
| cli/azd/extensions/azure.ai.agents/internal/pkg/agents/eval_api/generation_test.go | Tests generation helper behavior. |
| cli/azd/extensions/azure.ai.agents/internal/pkg/agents/eval_api/eval_config.go | Adds eval.yaml config loading/writing/validation and request building. |
| cli/azd/extensions/azure.ai.agents/internal/pkg/agents/eval_api/eval_config_test.go | Tests eval config validation and request building. |
| cli/azd/extensions/azure.ai.agents/internal/pkg/agents/eval_api/artifacts.go | Handles local artifact paths, dataset downloads, evaluator artifact persistence, JSON helpers. |
| cli/azd/extensions/azure.ai.agents/internal/pkg/agents/dataset_api/operations_test.go | Expands dataset API tests (client behavior and download). |
| cli/azd/extensions/azure.ai.agents/internal/pkg/agents/dataset_api/models.go | Adds dataset models + versioning and JSONL helpers. |
| cli/azd/extensions/azure.ai.agents/internal/pkg/agents/dataset_api/models_test.go | Tests dataset model helpers. |
| cli/azd/extensions/azure.ai.agents/internal/cmd/root.go | Registers new top-level eval/optimize commands. |
| cli/azd/extensions/azure.ai.agents/internal/cmd/optimize_test.go | Command-shape tests for optimize command and helpers. |
| cli/azd/extensions/azure.ai.agents/internal/cmd/optimize_status.go | Implements optimize status (with optional last-job fallback and watch). |
| cli/azd/extensions/azure.ai.agents/internal/cmd/optimize_status_test.go | Tests optimize status arg/flag shape. |
| cli/azd/extensions/azure.ai.agents/internal/cmd/optimize_list.go | Implements optimize list rendering and status filter validation. |
| cli/azd/extensions/azure.ai.agents/internal/cmd/optimize_list_test.go | Tests optimize list flags and connection flag presence. |
| cli/azd/extensions/azure.ai.agents/internal/cmd/optimize_helpers.go | Adds shared optimize helpers (endpoint resolution, last-job persistence, postdeploy reporting). |
| cli/azd/extensions/azure.ai.agents/internal/cmd/optimize_deploy_test.go | Tests deploy helper functions and command shape. |
| cli/azd/extensions/azure.ai.agents/internal/cmd/optimize_config_test.go | Tests optimize config loading/validation/request building and skill parsing. |
| cli/azd/extensions/azure.ai.agents/internal/cmd/optimize_cancel.go | Implements optimize cancel. |
| cli/azd/extensions/azure.ai.agents/internal/cmd/optimize_cancel_test.go | Tests optimize cancel arg/flag shape. |
| cli/azd/extensions/azure.ai.agents/internal/cmd/listen.go | Hooks postdeploy to report optimization candidate deployments. |
| cli/azd/extensions/azure.ai.agents/internal/cmd/listen_test.go | Adds/extends tests for hosted-agent detection, env var resolution, and postdeploy no-op paths. |
| cli/azd/extensions/azure.ai.agents/internal/cmd/eval_update.go | Implements eval update to upload local dataset/evaluator edits and update config versions. |
| cli/azd/extensions/azure.ai.agents/internal/cmd/eval_show.go | Implements eval show for eval definitions, run history, and run details. |
| cli/azd/extensions/azure.ai.agents/internal/cmd/eval_show_test.go | Tests eval show/update command shapes and parent command wiring. |
| cli/azd/extensions/azure.ai.agents/internal/cmd/eval_run.go | Implements eval run including dataset source selection and polling for run completion. |
| cli/azd/extensions/azure.ai.agents/internal/cmd/eval_run_test.go | Tests helper routines used by eval run. |
| cli/azd/extensions/azure.ai.agents/internal/cmd/eval_progress.go | Adds an interactive progress/spinner helper for eval flows. |
| cli/azd/extensions/azure.ai.agents/internal/cmd/eval_progress_test.go | Tests duration formatting helper. |
| cli/azd/extensions/azure.ai.agents/internal/cmd/eval_list.go | Implements eval list with parallel run-summary fetching. |
| cli/azd/extensions/azure.ai.agents/internal/cmd/eval_list_test.go | Tests eval list command shape/flags. |
| cli/azd/extensions/azure.ai.agents/internal/cmd/eval_helpers.go | Adds shared helpers for portal URLs, baseline config writing, JSONL loading, and status formatting. |
| cli/azd/extensions/azure.ai.agents/internal/cmd/eval_helpers_test.go | Tests shared eval helper behavior. |
wbreza
left a comment
There was a problem hiding this comment.
Thanks for shipping this — eval and optimize are big, valuable additions to the agents extension, and the test scaffolding around the new API clients shows real care. A few blockers and a handful of structural concerns are worth addressing before merge. Findings are grouped by severity.
🔴 CRITICAL — blockers
1. Test suite is broken on a clean checkout (23 failures)
Verified locally on the PR branch (go test -short ./internal/... in cli/azd/extensions/azure.ai.agents/):
dataset_api/eval_apiHTTP tests — everyhttptest.NewServer-based test (TestCreateDataset_Success,TestGetDataset_Success,TestDoRequest_EmptyBody,TestCreateDataGenerationJob_Success,TestGetDatasetCredential_Success,TestCreate/GetEvaluatorGenerationJob_Success,TestCreate/Get/List OpenAIEval[Run]_Success, etc.) fails with:authenticated requests are not permitted for non TLS protected (http) endpoints. The bearer token policy in the real pipeline refuses plain HTTP.optimize_apisolved this withNewOptimizeClientFromPipeline— please do the same fordataset_api/eval_apiand have tests construct via that helper.cmd/eval_init_test.go—TestNewEvalConfig/uses_default_name(line 145) andTestNewEvalConfig/stores_instruction_file_when_file_provided(line 176) assertcfg.Agent.Instruction.Value/.FilebutnewEvalConfigineval_init_jobs.gonever populatesAgent.Instruction.TestBuildOpenAIEvalRequest(line 429) expects{{item.messages}}but the output is empty. Either wire instruction resolution intonewEvalConfig(likely the original intent) or fix the assertions.cmd/eval_list_test.go:23— asserts--limitdefault"20", buteval_list.go:41defaults to10.cmd/optimize_test.go:46TestOptimizeCommand_AcceptsConfigFlag— asserts--strategyand--endpointflags exist; neither is registered inoptimize.go.
2. Zip-slip / path traversal — eval_api/artifacts.go:76-88 (dataset download)
blobName comes from the Azure Blob XML listing and is fed into filepath.Join(destDir, filepath.FromSlash(blobName)) with no validation. A malicious or compromised storage endpoint returning "../../../home/<user>/.bashrc" will overwrite arbitrary files outside destDir. Fix: after filepath.Clean, reject paths that are absolute or ..-prefixed, and verify filepath.Rel(destDir, dest) does not escape.
3. Zip-slip / path traversal — optimize_apply.go:430-447 (candidate file download)
Same class of bug. f.Path is server-controlled and written via filepath.Join(destDir, …) without containment check. Apply the same Rel/Clean guard.
4. SAS tokens written to stderr unconditionally
internal/pkg/agents/dataset_api/operations.go (≈L210/255/295 and the generic request logger) calls log.Printf("[dataset_api] downloading blob: %s", u.String()) with the full ?sv=…&sig=… SAS query string. Stdlib log always writes to stderr — captured by CI logs, shell wrappers, support bundles. Sibling code (connection_credentials.go) correctly logs key names only. Strip the query string before logging (u2 := *u; u2.RawQuery = ""), and remove the raw request body / response body dumps in dataset_api/operations.go and eval_api/operations.go — they bypass the SDK''s redaction policy and will leak SAS URIs, agent instructions, and eval datasets when --debug is on. Also set policy.LogOptions{IncludeBody: false} on the three pipelines (or gate behind a debug env-var) — IncludeBody: true causes the same leakage when AZURE_SDK_GO_LOGGING=all.
5. Dataset finalize uses container URI, not blob URI — dataset_api/operations.go:79-113
Step 2 uploads to containerSASUri/<name>.jsonl, but step 3 calls FinalizeDatasetVersion(..., pending.ResolvedBlobURI(), …), where ResolvedBlobURI() returns the container URI without the blob suffix. The registered dataset will point at the container rather than the JSONL just uploaded. Build the finalize URI explicitly:
dataURI := strings.TrimSuffix(pending.ResolvedBlobURI(), "/") + "/" + blobNameand add a unit test asserting the finalize payload matches the upload location.
6. Naive XML substring parse for blob listing — dataset_api/operations.go:322-341
parseBlobNames literally scans for <Name> and will also match the same element inside <NextMarker>, <Prefix>, and any future API element. Use encoding/xml against the documented EnumerationResults > Blobs > Blob > Name schema. Wrong/extra blob names → downstream 404s.
7. optimize_api/poller.go:25-50 — no cap, panics on zero Interval
PollUntilDone loops until terminal status or ctx.Done(); there is no MaxAttempts (the eval poller has one). time.NewTicker(0) panics if a caller leaves Interval zero. Mirror eval_api.PollerOptions.MaxAttempts and validate Interval > 0 (default to a few seconds).
8. JSON tag casing mismatch — dataset_api/models.go:23-30
Dataset.DataURI is tagged json:"data_uri" but every other request type in the same file (and the service itself) uses camelCase (dataUri, blobUri, contentUri). GetDataset currently deserializes to an empty struct. Normalize the Dataset fields and add a fixture-based unmarshal test against a recorded service payload.
🟠 HIGH
API / architecture
- Duplicate client construction across
dataset_api/eval_api/optimize_api— extract a sharedaiapi.Clientto dedupe the ~30 lines of pipeline setup and thedoRequest/doRequestTypedhelpers.optimize_apiis the worst offender (re-implements the same shape 6+ times inline). optimize_api/client.go(L76, 120, 156, 194, 231, 270, 305, 341) — endpoints built viafmt.Sprintfinstead ofurl.Parse + Query().Set(...). Trailing-slash and unescaped query values (notablystatus=at L158) are buggy/unsafe.GetCandidateFilealready does this correctly — use that pattern everywhere.eval_api/artifacts.go:40-104DownloadDatasetArtifact— swallows credential errors, list failures, per-blob download/write failures, and empty blob lists; returns("", nil)for all of them. Callers cannot distinguish "no artifact configured" from "auth failed". Wrap anderrors.Joinper-blob errors.optimize_api/client.go:269GetCandidateConfigreturnsany— callers lose all schema discipline. Returnjson.RawMessageor a typed struct.optimize_api/models.go:117EvalModel— missing,omitempty; serializes"evalModel":""and services commonly reject empty fields.eval_api/portal_urls.go:66-70OptimizationURL— hardcodeseastus2euapcanary host and?flight=enable_faos_read_ui. Customers in other regions get a broken portal link. Pull host fromPortalPrefixand gate the flight flag.
CLI / UX
- Help text /
Usedrift (optimize.go:96-102,eval.go:80-88) —optimizementionsstatus / list / cancel / deployinExamplebut omits the registeredapplysubcommand.evalLongmentions onlyinitandrunwhileupdate,list, andshoware all registered. Users won''t discover them. optimize.gorepeatedly opensazdext.NewAzdClient()in 7+ places. The eval path threads a single client throughevalResolvedContext— do the same here. Today oneoptimizeinvocation churns the gRPC channel 7+ times.- Default
--eval-modeldrift (optimize.go:139vsoptimize_config.go:95) — flag default"gpt-4.1-mini", config default"gpt-4o". Because the flag default is always non-empty, the config default is dead code. Single source of truth in a constant. eval_list.go:72-89— unbounded fan-out — spawnsflags.limitgoroutines (user-controlled, no cap).--limit 1000fires 1000 concurrent HTTP requests. Bound with a semaphore (8–16).eval_progress.go:42-45— data race onp.spinning—Start()writes withoutmu; every other method holdsmu.go test -racewill catch this.eval_init.go:153-155—if flags.resetDefaults … { opteval.ClearEvalState(…) }discards the error. User thinks state was reset; it wasn''t.eval_init_prompts.go:50-51, 145—needsGeneration := true/needsEvalGen := trueare unconditional locals makingif needsGenerationalways true andif !needsGenerationalways false. Either wire to a real flag or delete the dead branching.eval_init.go:199-200— error message says--gen-instruction is requiredbut the condition tests four orthogonal inputs. Make the message reflect the actual options.
Lint / process
service_target_agent.go:1526—llllint violation — ~155 visual columns vs 125 max enforced bygolangci-lint. Will fail CI. Break the string concatenation.extension.yamlversion not bumped — adds new subcommands but ships at the same0.1.32-previewalready inregistry.jsononmain. Without bumping (and a registry update PR), users on the current published version do not get the new commands. Document the rollout path in the PR description.
🟡 MEDIUM (selected)
optimize_api/poller.go:30-33— single transient HTTP failure aborts the poller. Add a small bounded retry on 5xx/connection-reset.opteval/state.go:46-86—LoadEvalStateswallows all gRPC errors (silently restarts job creation → double billing);SaveEvalStateis non-atomic (partial failure mid-loop leaves mixed-vintage state). Log/wrap errors; consider batchedSetValuesifazdextsupports one.opteval/yaml.go:397-403—if o.MaxIterations <= 0 { o.MaxIterations = 4 }overrides explicitmax_iterations: 0. Distinguish unset from zero (*int) or document.eval_api/eval_config.go:65-82Validate— doesn''t checkcfg.Nameor thatcfg.Evaluatorsis non-empty; surface as actionable errors before the service rejects with an opaque one.eval_api/artifacts.go:140-145—DatasetArtifactPathdrops the version segment that the doc comment promises (datasets/<name>/<version>/). Versioned downloads will overwrite each other on disk.eval_api/artifacts.go:167-225—SaveEvaluatorResultandWriteEvalReviewArtifactsboth return void and discard all I/O errors. Return errors per repo convention.optimize.go:140--targetshort-s— collides with--session-idsemantics elsewhere in the extension.-tis free.eval_init.go:89--out-fileshort-Oandeval_init.go:84--gen-instruction-fileshort-G— uppercase short flags are non-idiomatic. Use lowercase or drop.eval_run.go:237— hardcodedinterval = 5*time.Second; maxAttempts = 360(a 30-min budget). Other commands expose--poll-interval. Extract to constants and/or a flag.optimize.go:140--target— accepts arbitrary strings; help text says "instruction, skill".--statusis correctly whitelisted (optimize_list.go:58-63); do the same here.eval_show.go:135— pagination message always reads "showing N of N" because both sides are the same value. Drop "of N" or fetch a total.- Error categorization — eval/optimize commands return plain
fmt.Errorffor user-facing terminal failures whileeval.go:178-183already usesexterrors.Dependency. Converteval run was canceled→exterrors.Cancelled, validation errors →exterrors.Validationwith codes, etc. optimize_deploy.go:261-268— brittle substring matching on server error wording to detect reserved env vars. Useazcore.ResponseError.StatusCode+ a stable API error code (add a TODO if the contract isn''t there yet).optimize.go:303-306andoptimize_prompts.go:278-298— commented-out blocks with// TODO: re-enable when tools optimization is supported. Either feature-flag or delete + file an issue.listen.gopostdeploy hook —reportOptimizationDeploymentsis fire-and-forget with a broadrecover(). Narrow the recover to the report call so genuine panics in surrounding deploy logic still surface.service_target_agent.go:1521-1527— unconditional "Set up an evaluation suite…" marketing tag pointing at another alpha command, appended to every hosted agent deploy with no opt-out, no telemetry/feature-flag, no test. Suggest gating on absence ofeval.yamlin the service dir.listen_test.go(+309 lines) — none of the new tests cover the +11-line behavioral change inlisten.go(reportOptimizationDeploymentscall frompostdeployHandler). The factory injection was added for this purpose — add at least onepostdeployHandler-level test that exercises it with a fakeOptimizeClient.- Test gaps — these production files have zero test coverage and contain non-trivial logic:
cmd/eval_init_jobs.go(444 LOC, retry/resume/poll),cmd/optimize_prompts.go(446 LOC),cmd/eval_init_prompts.go(367 LOC),cmd/eval_update.go(245 LOC),pkg/agents/eval_api/artifacts.go(download/write),pkg/agents/eval_api/portal_urls.go(pure string building — trivially testable),pkg/agents/opteval/state.go(state persistence). No end-to-end CLI test drivesrunOptimize/runEvalInitthrough with a fake client injected.
🔵 LOW (selected)
optimize_api/client.go:16— drop thenetURL "net/url"alias.eval_api/models.go:276,284—type X = stringaliases give no type safety; switch to non-aliastype X string.optimize_api/models.go:205— renameTaskScore.DurationtoDurationSecondsto match the JSON tag andJobProgress.ElapsedSeconds.optimize_api/models.go:165-185—JobError.UnmarshalJSONof explicitnullproducesJobError{Message:""}that looks like a real failure; short-circuitnull.dataset_api/models.go:142-167—NextVersion("1.5") → "2.5"is surprising. Document or switch to integer-major bump.eval_api/poller.go:158-163— first poll sleeps beforeGetJob; cheap to invert.eval_run.go:317-322—agentVersionPtrcan benew(version).optimize.go:123—MaximumNArgs(1)positional duplicates--agent; pick one mechanism.optimize.go:564-572—truncateStringmay duplicate an existing helper.eval_init.go:91—--reset-defaultsis non-idiomatic; consider--force/--overwrite.- cspell additions needed —
opteval,FAOS,frontmatter,Parseable,restype,cand. Add to the cspell dictionary so spellcheck stays green.
Process
- Commit history (24+ commits) —
more bugbash,more bug bash,more,fix,more fixes,check ui,review version, etc. Please squash-merge with a single conventional-commit message; do not land this history onmain. - PR description does not mention the
service_target_agent.godrive-by that injects aneval initsuggestion into every hosted agent deploy. That''s a UX policy decision worth calling out explicitly so reviewers don''t miss it. - Rollout — please add a short "How users get this" section: which
extension.yamlversion, which registry PR follows, whether users needazd x install --source ...in the interim.
What''s working well
- The
azd ai agent eval/optimizecommand shape is reasonable and feels cohesive with the existing extension. - Test scaffolding around the API clients is genuinely substantial — the structure is right; it just needs the bearer-policy bypass to actually run.
- Credential handling is clean —
DefaultAzureCredential/AzureDeveloperCLICredentialvia the SDK pipeline, noInsecureSkipVerify, no shell exec, no SSRF surface, file permissions correct (0600/0750). optevalpackage layering is clean (lowest level, no API-package imports).- Properly uses Go 1.26 patterns (
new(T),t.Context(),errors.Join) — no flags on those.
Happy to pair on any of the blockers — particularly the bearer-policy test bypass pattern, since optimize_api already has the right shape to copy.
|
Quick delta from my prior review — the 3 new commits since
None of the 8 🔴 CRITICAL or any 🟠 HIGH findings from my prior review have been addressed. Confirmed unchanged: New ask: please drop the |
|
Thanks — most critical/high issues addressed and tests are green on a clean checkout. Nice turnaround. Remaining items before approval:
Non-blocking nits worth a follow-up issue if not addressed here:
Status summary from my prior reviews: 7 of 8 🔴 CRITICAL fixed (only #8 remains), most 🟠 HIGH fixed, test suite now green (24 prior failures resolved). Big improvement. |
[feat] Add support for azd ai agent eval/optimize
Overview
Adds two new command families —
azd ai agent evalandazd ai agent optimize— to theazure.ai.agentsextension, enabling developers to evaluate and iteratively optimize their AI agents directly from the CLI.Eval Commands (
azd ai agent eval)End-to-end evaluation workflow for deployed agents:
eval init— Generates a local eval suite (eval.yaml) by auto-creating datasets and evaluators from agent traces and instructions. Resolves agent context, submits generation jobs, polls for completion, downloads review artifacts, and writes config to.agent_configs/baseline/.eval run— Executes an evaluation fromeval.yaml— creates or reuses an OpenAI eval, submits a run with the configured dataset, and polls for results.eval list— Lists recent evaluations with run counts and last status.eval show— Displays eval definitions, run history, and per-criteria score breakdowns. Supports JSON export via--out-file.eval update— Detects locally-edited evaluators/datasets (vialocal_uripointers in config), uploads them as new versions, and updates version references ineval.yaml.Optimize Commands (
azd ai agent optimize)Iterative agent optimization with baseline scoring and candidate generation:
optimize [agent]— Submits an optimization job targeting instruction, skills, or model. Builds request from config/flags, polls for progress, and reports best candidate score.optimize status— Displays job progress — strategy, iteration count, task count, best score, and elapsed time. Supports--watchmode.optimize list— Lists recent optimization runs with status, agent name, and score. Filterable by--status(pending/running/completed/failed/cancelled).optimize cancel— Cancels a running or pending optimization job.optimize apply— Downloads a candidate config and writes it to.agent_configs/{candidate-id}/(instruction, skills, tools). Shows diff vs baseline. Follow up withazd deploy.optimize deploy— Deploys a candidate directly to Azure AI Foundry as a new agent version, bypassing the local apply+deploy cycle.Supporting Packages
eval_api— Client for dataset/evaluator generation jobs, OpenAI eval CRUD, artifact download with SAS URI support, and Foundry portal URL generation.optimize_api— Client for optimization lifecycle: submit, poll status, cancel, list runs, fetch candidate manifests, and download candidate files.opteval— Shared YAML schema (eval.yamlconfig with agent refs, dataset refs, evaluator lists), runtime state persistence in azd environment variables, and.agent_configs/directory layout management.agent_yaml— Agent manifest schema supportinghostedandworkflowagent kinds, tool kinds (function, MCP, OpenAPI, code_interpreter, etc.), and container agent configuration.Key Design Decisions
azure.yamlprojects, persist state (operation IDs, eval IDs) in azd environment variables, and integrate withazd deployfor the apply workflow.--no-promptfor CI scenarios.--no-wait, with status commands to check later.eval init→ edit locally →eval update→eval runenables fast iteration without leaving the terminal.