[feat] Add support for azd ai agent eval + optimize by Zyysurely · Pull Request #8306 · Azure/azure-dev

Zyysurely · 2026-05-21T18:51:05Z

[feat] Add support for azd ai agent eval/optimize

Overview

Adds two new command families — azd ai agent eval and azd ai agent optimize — to the azure.ai.agents extension, enabling developers to evaluate and iteratively optimize their AI agents directly from the CLI.

Eval Commands (`azd ai agent eval`)

End-to-end evaluation workflow for deployed agents:

eval init — Generates a local eval suite (eval.yaml) by auto-creating datasets and evaluators from agent traces and instructions. Resolves agent context, submits generation jobs, polls for completion, downloads review artifacts, and writes config to .agent_configs/baseline/.
eval run — Executes an evaluation from eval.yaml — creates or reuses an OpenAI eval, submits a run with the configured dataset, and polls for results.
eval list — Lists recent evaluations with run counts and last status.
eval show — Displays eval definitions, run history, and per-criteria score breakdowns. Supports JSON export via --out-file.
eval update — Detects locally-edited evaluators/datasets (via local_uri pointers in config), uploads them as new versions, and updates version references in eval.yaml.

Optimize Commands (`azd ai agent optimize`)

Iterative agent optimization with baseline scoring and candidate generation:

optimize [agent] — Submits an optimization job targeting instruction, skills, or model. Builds request from config/flags, polls for progress, and reports best candidate score.
optimize status — Displays job progress — strategy, iteration count, task count, best score, and elapsed time. Supports --watch mode.
optimize list — Lists recent optimization runs with status, agent name, and score. Filterable by --status (pending/running/completed/failed/cancelled).
optimize cancel — Cancels a running or pending optimization job.
optimize apply — Downloads a candidate config and writes it to .agent_configs/{candidate-id}/ (instruction, skills, tools). Shows diff vs baseline. Follow up with azd deploy.
optimize deploy — Deploys a candidate directly to Azure AI Foundry as a new agent version, bypassing the local apply+deploy cycle.

Supporting Packages

eval_api — Client for dataset/evaluator generation jobs, OpenAI eval CRUD, artifact download with SAS URI support, and Foundry portal URL generation.
optimize_api — Client for optimization lifecycle: submit, poll status, cancel, list runs, fetch candidate manifests, and download candidate files.
opteval — Shared YAML schema (eval.yaml config with agent refs, dataset refs, evaluator lists), runtime state persistence in azd environment variables, and .agent_configs/ directory layout management.
agent_yaml — Agent manifest schema supporting hosted and workflow agent kinds, tool kinds (function, MCP, OpenAPI, code_interpreter, etc.), and container agent configuration.

Key Design Decisions

Azd-native integration: Commands resolve agent context from azure.yaml projects, persist state (operation IDs, eval IDs) in azd environment variables, and integrate with azd deploy for the apply workflow.
Interactive + non-interactive modes: All prompts support --no-prompt for CI scenarios.
No-wait support: Long-running generation/optimization jobs can run asynchronously with --no-wait, with status commands to check later.
Local-first iteration loop: eval init → edit locally → eval update → eval run enables fast iteration without leaving the terminal.

Copilot

Pull request overview

Adds first-class eval and optimize workflows to the azure.ai.agents azd extension, including new API clients/models, polling helpers, and CLI commands for running, listing, showing, updating, and deploying optimization/evaluation assets.

Changes:

Introduces Optimize API client/models + polling loop, with corresponding unit tests.
Introduces Eval API client/helpers (portal URLs, polling, generation request helpers, config parsing/writing) + unit tests.
Adds new CLI command groups (azd ai agent eval, azd ai agent optimize) and integrates optimization deployment reporting into post-deploy.

Reviewed changes

Copilot reviewed 62 out of 62 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
cli/azd/extensions/azure.ai.agents/internal/project/service_target_agent.go	Adds a deploy note pointing users to eval initialization.
cli/azd/extensions/azure.ai.agents/internal/pkg/agents/optimize_api/poller.go	New polling loop for optimization job status.
cli/azd/extensions/azure.ai.agents/internal/pkg/agents/optimize_api/poller_test.go	Tests optimization polling behavior.
cli/azd/extensions/azure.ai.agents/internal/pkg/agents/optimize_api/models.go	Adds optimization request/response models and status helpers.
cli/azd/extensions/azure.ai.agents/internal/pkg/agents/optimize_api/models_test.go	JSON round-trip tests for optimization models.
cli/azd/extensions/azure.ai.agents/internal/pkg/agents/optimize_api/client.go	Adds Optimize API HTTP client for submit/status/list/cancel/promote/candidate fetch.
cli/azd/extensions/azure.ai.agents/internal/pkg/agents/optimize_api/client_test.go	Tests Optimize API client behaviors and error handling.
cli/azd/extensions/azure.ai.agents/internal/pkg/agents/opteval/yaml_test.go	Adds YAML round-trip tests for shared opt/eval config structures.
cli/azd/extensions/azure.ai.agents/internal/pkg/agents/opteval/state.go	Persists eval init/run state across invocations via azd environment keys.
cli/azd/extensions/azure.ai.agents/internal/pkg/agents/eval_api/portal_urls.go	Constructs Foundry portal URLs for eval and optimization artifacts.
cli/azd/extensions/azure.ai.agents/internal/pkg/agents/eval_api/poller.go	Adds a generalized poller for eval generation jobs with typed status/errors.
cli/azd/extensions/azure.ai.agents/internal/pkg/agents/eval_api/poller_test.go	Tests eval poller status/timeout/cancel semantics.
cli/azd/extensions/azure.ai.agents/internal/pkg/agents/eval_api/operations.go	Adds Eval API HTTP operations with a generic typed request helper.
cli/azd/extensions/azure.ai.agents/internal/pkg/agents/eval_api/generation.go	Helper builders for generation sources, evaluator classification, dataset-name detection.
cli/azd/extensions/azure.ai.agents/internal/pkg/agents/eval_api/generation_test.go	Tests generation helper behavior.
cli/azd/extensions/azure.ai.agents/internal/pkg/agents/eval_api/eval_config.go	Adds eval.yaml config loading/writing/validation and request building.
cli/azd/extensions/azure.ai.agents/internal/pkg/agents/eval_api/eval_config_test.go	Tests eval config validation and request building.
cli/azd/extensions/azure.ai.agents/internal/pkg/agents/eval_api/artifacts.go	Handles local artifact paths, dataset downloads, evaluator artifact persistence, JSON helpers.
cli/azd/extensions/azure.ai.agents/internal/pkg/agents/dataset_api/operations_test.go	Expands dataset API tests (client behavior and download).
cli/azd/extensions/azure.ai.agents/internal/pkg/agents/dataset_api/models.go	Adds dataset models + versioning and JSONL helpers.
cli/azd/extensions/azure.ai.agents/internal/pkg/agents/dataset_api/models_test.go	Tests dataset model helpers.
cli/azd/extensions/azure.ai.agents/internal/cmd/root.go	Registers new top-level eval/optimize commands.
cli/azd/extensions/azure.ai.agents/internal/cmd/optimize_test.go	Command-shape tests for optimize command and helpers.
cli/azd/extensions/azure.ai.agents/internal/cmd/optimize_status.go	Implements `optimize status` (with optional last-job fallback and watch).
cli/azd/extensions/azure.ai.agents/internal/cmd/optimize_status_test.go	Tests `optimize status` arg/flag shape.
cli/azd/extensions/azure.ai.agents/internal/cmd/optimize_list.go	Implements `optimize list` rendering and status filter validation.
cli/azd/extensions/azure.ai.agents/internal/cmd/optimize_list_test.go	Tests `optimize list` flags and connection flag presence.
cli/azd/extensions/azure.ai.agents/internal/cmd/optimize_helpers.go	Adds shared optimize helpers (endpoint resolution, last-job persistence, postdeploy reporting).
cli/azd/extensions/azure.ai.agents/internal/cmd/optimize_deploy_test.go	Tests deploy helper functions and command shape.
cli/azd/extensions/azure.ai.agents/internal/cmd/optimize_config_test.go	Tests optimize config loading/validation/request building and skill parsing.
cli/azd/extensions/azure.ai.agents/internal/cmd/optimize_cancel.go	Implements `optimize cancel`.
cli/azd/extensions/azure.ai.agents/internal/cmd/optimize_cancel_test.go	Tests `optimize cancel` arg/flag shape.
cli/azd/extensions/azure.ai.agents/internal/cmd/listen.go	Hooks postdeploy to report optimization candidate deployments.
cli/azd/extensions/azure.ai.agents/internal/cmd/listen_test.go	Adds/extends tests for hosted-agent detection, env var resolution, and postdeploy no-op paths.
cli/azd/extensions/azure.ai.agents/internal/cmd/eval_update.go	Implements `eval update` to upload local dataset/evaluator edits and update config versions.
cli/azd/extensions/azure.ai.agents/internal/cmd/eval_show.go	Implements `eval show` for eval definitions, run history, and run details.
cli/azd/extensions/azure.ai.agents/internal/cmd/eval_show_test.go	Tests eval show/update command shapes and parent command wiring.
cli/azd/extensions/azure.ai.agents/internal/cmd/eval_run.go	Implements `eval run` including dataset source selection and polling for run completion.
cli/azd/extensions/azure.ai.agents/internal/cmd/eval_run_test.go	Tests helper routines used by `eval run`.
cli/azd/extensions/azure.ai.agents/internal/cmd/eval_progress.go	Adds an interactive progress/spinner helper for eval flows.
cli/azd/extensions/azure.ai.agents/internal/cmd/eval_progress_test.go	Tests duration formatting helper.
cli/azd/extensions/azure.ai.agents/internal/cmd/eval_list.go	Implements `eval list` with parallel run-summary fetching.
cli/azd/extensions/azure.ai.agents/internal/cmd/eval_list_test.go	Tests eval list command shape/flags.
cli/azd/extensions/azure.ai.agents/internal/cmd/eval_helpers.go	Adds shared helpers for portal URLs, baseline config writing, JSONL loading, and status formatting.
cli/azd/extensions/azure.ai.agents/internal/cmd/eval_helpers_test.go	Tests shared eval helper behavior.

wbreza

Thanks for shipping this — eval and optimize are big, valuable additions to the agents extension, and the test scaffolding around the new API clients shows real care. A few blockers and a handful of structural concerns are worth addressing before merge. Findings are grouped by severity.

🔴 CRITICAL — blockers

1. Test suite is broken on a clean checkout (23 failures)

Verified locally on the PR branch (go test -short ./internal/... in cli/azd/extensions/azure.ai.agents/):

dataset_api / eval_api HTTP tests — every httptest.NewServer-based test (TestCreateDataset_Success, TestGetDataset_Success, TestDoRequest_EmptyBody, TestCreateDataGenerationJob_Success, TestGetDatasetCredential_Success, TestCreate/GetEvaluatorGenerationJob_Success, TestCreate/Get/List OpenAIEval[Run]_Success, etc.) fails with: authenticated requests are not permitted for non TLS protected (http) endpoints. The bearer token policy in the real pipeline refuses plain HTTP. optimize_api solved this with NewOptimizeClientFromPipeline — please do the same for dataset_api / eval_api and have tests construct via that helper.
cmd/eval_init_test.go — TestNewEvalConfig/uses_default_name (line 145) and TestNewEvalConfig/stores_instruction_file_when_file_provided (line 176) assert cfg.Agent.Instruction.Value / .File but newEvalConfig in eval_init_jobs.go never populates Agent.Instruction. TestBuildOpenAIEvalRequest (line 429) expects {{item.messages}} but the output is empty. Either wire instruction resolution into newEvalConfig (likely the original intent) or fix the assertions.
cmd/eval_list_test.go:23 — asserts --limit default "20", but eval_list.go:41 defaults to 10.
cmd/optimize_test.go:46 TestOptimizeCommand_AcceptsConfigFlag — asserts --strategy and --endpoint flags exist; neither is registered in optimize.go.

2. Zip-slip / path traversal — `eval_api/artifacts.go:76-88` (dataset download)

blobName comes from the Azure Blob XML listing and is fed into filepath.Join(destDir, filepath.FromSlash(blobName)) with no validation. A malicious or compromised storage endpoint returning "../../../home/<user>/.bashrc" will overwrite arbitrary files outside destDir. Fix: after filepath.Clean, reject paths that are absolute or ..-prefixed, and verify filepath.Rel(destDir, dest) does not escape.

3. Zip-slip / path traversal — `optimize_apply.go:430-447` (candidate file download)

Same class of bug. f.Path is server-controlled and written via filepath.Join(destDir, …) without containment check. Apply the same Rel/Clean guard.

4. SAS tokens written to stderr unconditionally

internal/pkg/agents/dataset_api/operations.go (≈L210/255/295 and the generic request logger) calls log.Printf("[dataset_api] downloading blob: %s", u.String()) with the full ?sv=…&sig=… SAS query string. Stdlib log always writes to stderr — captured by CI logs, shell wrappers, support bundles. Sibling code (connection_credentials.go) correctly logs key names only. Strip the query string before logging (u2 := *u; u2.RawQuery = ""), and remove the raw request body / response body dumps in dataset_api/operations.go and eval_api/operations.go — they bypass the SDK''s redaction policy and will leak SAS URIs, agent instructions, and eval datasets when --debug is on. Also set policy.LogOptions{IncludeBody: false} on the three pipelines (or gate behind a debug env-var) — IncludeBody: true causes the same leakage when AZURE_SDK_GO_LOGGING=all.

5. Dataset finalize uses container URI, not blob URI — `dataset_api/operations.go:79-113`

Step 2 uploads to containerSASUri/<name>.jsonl, but step 3 calls FinalizeDatasetVersion(..., pending.ResolvedBlobURI(), …), where ResolvedBlobURI() returns the container URI without the blob suffix. The registered dataset will point at the container rather than the JSONL just uploaded. Build the finalize URI explicitly:

dataURI := strings.TrimSuffix(pending.ResolvedBlobURI(), "/") + "/" + blobName

and add a unit test asserting the finalize payload matches the upload location.

6. Naive XML substring parse for blob listing — `dataset_api/operations.go:322-341`

parseBlobNames literally scans for <Name> and will also match the same element inside <NextMarker>, <Prefix>, and any future API element. Use encoding/xml against the documented EnumerationResults > Blobs > Blob > Name schema. Wrong/extra blob names → downstream 404s.

7. `optimize_api/poller.go:25-50` — no cap, panics on zero `Interval`

PollUntilDone loops until terminal status or ctx.Done(); there is no MaxAttempts (the eval poller has one). time.NewTicker(0) panics if a caller leaves Interval zero. Mirror eval_api.PollerOptions.MaxAttempts and validate Interval > 0 (default to a few seconds).

8. JSON tag casing mismatch — `dataset_api/models.go:23-30`

Dataset.DataURI is tagged json:"data_uri" but every other request type in the same file (and the service itself) uses camelCase (dataUri, blobUri, contentUri). GetDataset currently deserializes to an empty struct. Normalize the Dataset fields and add a fixture-based unmarshal test against a recorded service payload.

🟠 HIGH

API / architecture

Duplicate client construction across dataset_api/eval_api/optimize_api — extract a shared aiapi.Client to dedupe the ~30 lines of pipeline setup and the doRequest / doRequestTyped helpers. optimize_api is the worst offender (re-implements the same shape 6+ times inline).
optimize_api/client.go (L76, 120, 156, 194, 231, 270, 305, 341) — endpoints built via fmt.Sprintf instead of url.Parse + Query().Set(...). Trailing-slash and unescaped query values (notably status= at L158) are buggy/unsafe. GetCandidateFile already does this correctly — use that pattern everywhere.
eval_api/artifacts.go:40-104 DownloadDatasetArtifact — swallows credential errors, list failures, per-blob download/write failures, and empty blob lists; returns ("", nil) for all of them. Callers cannot distinguish "no artifact configured" from "auth failed". Wrap and errors.Join per-blob errors.
optimize_api/client.go:269 GetCandidateConfig returns any — callers lose all schema discipline. Return json.RawMessage or a typed struct.
optimize_api/models.go:117 EvalModel — missing ,omitempty; serializes "evalModel":"" and services commonly reject empty fields.
eval_api/portal_urls.go:66-70 OptimizationURL — hardcodes eastus2euap canary host and ?flight=enable_faos_read_ui. Customers in other regions get a broken portal link. Pull host from PortalPrefix and gate the flight flag.

CLI / UX

Help text / Use drift (optimize.go:96-102, eval.go:80-88) — optimize mentions status / list / cancel / deploy in Example but omits the registered apply subcommand. eval Long mentions only init and run while update, list, and show are all registered. Users won''t discover them.
optimize.go repeatedly opens azdext.NewAzdClient() in 7+ places. The eval path threads a single client through evalResolvedContext — do the same here. Today one optimize invocation churns the gRPC channel 7+ times.
Default --eval-model drift (optimize.go:139 vs optimize_config.go:95) — flag default "gpt-4.1-mini", config default "gpt-4o". Because the flag default is always non-empty, the config default is dead code. Single source of truth in a constant.
eval_list.go:72-89 — unbounded fan-out — spawns flags.limit goroutines (user-controlled, no cap). --limit 1000 fires 1000 concurrent HTTP requests. Bound with a semaphore (8–16).
eval_progress.go:42-45 — data race on p.spinning — Start() writes without mu; every other method holds mu. go test -race will catch this.
eval_init.go:153-155 — if flags.resetDefaults … { opteval.ClearEvalState(…) } discards the error. User thinks state was reset; it wasn''t.
eval_init_prompts.go:50-51, 145 — needsGeneration := true / needsEvalGen := true are unconditional locals making if needsGeneration always true and if !needsGeneration always false. Either wire to a real flag or delete the dead branching.
eval_init.go:199-200 — error message says --gen-instruction is required but the condition tests four orthogonal inputs. Make the message reflect the actual options.

Lint / process

service_target_agent.go:1526 — lll lint violation — ~155 visual columns vs 125 max enforced by golangci-lint. Will fail CI. Break the string concatenation.
extension.yaml version not bumped — adds new subcommands but ships at the same 0.1.32-preview already in registry.json on main. Without bumping (and a registry update PR), users on the current published version do not get the new commands. Document the rollout path in the PR description.

🟡 MEDIUM (selected)

optimize_api/poller.go:30-33 — single transient HTTP failure aborts the poller. Add a small bounded retry on 5xx/connection-reset.
opteval/state.go:46-86 — LoadEvalState swallows all gRPC errors (silently restarts job creation → double billing); SaveEvalState is non-atomic (partial failure mid-loop leaves mixed-vintage state). Log/wrap errors; consider batched SetValues if azdext supports one.
opteval/yaml.go:397-403 — if o.MaxIterations <= 0 { o.MaxIterations = 4 } overrides explicit max_iterations: 0. Distinguish unset from zero (*int) or document.
eval_api/eval_config.go:65-82 Validate — doesn''t check cfg.Name or that cfg.Evaluators is non-empty; surface as actionable errors before the service rejects with an opaque one.
eval_api/artifacts.go:140-145 — DatasetArtifactPath drops the version segment that the doc comment promises (datasets/<name>/<version>/). Versioned downloads will overwrite each other on disk.
eval_api/artifacts.go:167-225 — SaveEvaluatorResult and WriteEvalReviewArtifacts both return void and discard all I/O errors. Return errors per repo convention.
optimize.go:140 --target short -s — collides with --session-id semantics elsewhere in the extension. -t is free.
eval_init.go:89 --out-file short -O and eval_init.go:84 --gen-instruction-file short -G — uppercase short flags are non-idiomatic. Use lowercase or drop.
eval_run.go:237 — hardcoded interval = 5*time.Second; maxAttempts = 360 (a 30-min budget). Other commands expose --poll-interval. Extract to constants and/or a flag.
optimize.go:140 --target — accepts arbitrary strings; help text says "instruction, skill". --status is correctly whitelisted (optimize_list.go:58-63); do the same here.
eval_show.go:135 — pagination message always reads "showing N of N" because both sides are the same value. Drop "of N" or fetch a total.
Error categorization — eval/optimize commands return plain fmt.Errorf for user-facing terminal failures while eval.go:178-183 already uses exterrors.Dependency. Convert eval run was canceled → exterrors.Cancelled, validation errors → exterrors.Validation with codes, etc.
optimize_deploy.go:261-268 — brittle substring matching on server error wording to detect reserved env vars. Use azcore.ResponseError.StatusCode + a stable API error code (add a TODO if the contract isn''t there yet).
optimize.go:303-306 and optimize_prompts.go:278-298 — commented-out blocks with // TODO: re-enable when tools optimization is supported. Either feature-flag or delete + file an issue.
listen.go postdeploy hook — reportOptimizationDeployments is fire-and-forget with a broad recover(). Narrow the recover to the report call so genuine panics in surrounding deploy logic still surface.
service_target_agent.go:1521-1527 — unconditional "Set up an evaluation suite…" marketing tag pointing at another alpha command, appended to every hosted agent deploy with no opt-out, no telemetry/feature-flag, no test. Suggest gating on absence of eval.yaml in the service dir.
listen_test.go (+309 lines) — none of the new tests cover the +11-line behavioral change in listen.go (reportOptimizationDeployments call from postdeployHandler). The factory injection was added for this purpose — add at least one postdeployHandler-level test that exercises it with a fake OptimizeClient.
Test gaps — these production files have zero test coverage and contain non-trivial logic: cmd/eval_init_jobs.go (444 LOC, retry/resume/poll), cmd/optimize_prompts.go (446 LOC), cmd/eval_init_prompts.go (367 LOC), cmd/eval_update.go (245 LOC), pkg/agents/eval_api/artifacts.go (download/write), pkg/agents/eval_api/portal_urls.go (pure string building — trivially testable), pkg/agents/opteval/state.go (state persistence). No end-to-end CLI test drives runOptimize / runEvalInit through with a fake client injected.

🔵 LOW (selected)

optimize_api/client.go:16 — drop the netURL "net/url" alias.
eval_api/models.go:276,284 — type X = string aliases give no type safety; switch to non-alias type X string.
optimize_api/models.go:205 — rename TaskScore.Duration to DurationSeconds to match the JSON tag and JobProgress.ElapsedSeconds.
optimize_api/models.go:165-185 — JobError.UnmarshalJSON of explicit null produces JobError{Message:""} that looks like a real failure; short-circuit null.
dataset_api/models.go:142-167 — NextVersion("1.5") → "2.5" is surprising. Document or switch to integer-major bump.
eval_api/poller.go:158-163 — first poll sleeps before GetJob; cheap to invert.
eval_run.go:317-322 — agentVersionPtr can be new(version).
optimize.go:123 — MaximumNArgs(1) positional duplicates --agent; pick one mechanism.
optimize.go:564-572 — truncateString may duplicate an existing helper.
eval_init.go:91 — --reset-defaults is non-idiomatic; consider --force / --overwrite.
cspell additions needed — opteval, FAOS, frontmatter, Parseable, restype, cand. Add to the cspell dictionary so spellcheck stays green.

Process

Commit history (24+ commits) — more bugbash, more bug bash, more, fix, more fixes, check ui, review version, etc. Please squash-merge with a single conventional-commit message; do not land this history on main.
PR description does not mention the service_target_agent.go drive-by that injects an eval init suggestion into every hosted agent deploy. That''s a UX policy decision worth calling out explicitly so reviewers don''t miss it.
Rollout — please add a short "How users get this" section: which extension.yaml version, which registry PR follows, whether users need azd x install --source ... in the interim.

What''s working well

The azd ai agent eval/optimize command shape is reasonable and feels cohesive with the existing extension.
Test scaffolding around the API clients is genuinely substantial — the structure is right; it just needs the bearer-policy bypass to actually run.
Credential handling is clean — DefaultAzureCredential / AzureDeveloperCLICredential via the SDK pipeline, no InsecureSkipVerify, no shell exec, no SSRF surface, file permissions correct (0600/0750).
opteval package layering is clean (lowest level, no API-package imports).
Properly uses Go 1.26 patterns (new(T), t.Context(), errors.Join) — no flags on those.

Happy to pair on any of the blockers — particularly the bearer-policy test bypass pattern, since optimize_api already has the right shape to copy.

wbreza · 2026-05-21T21:04:51Z

Quick delta from my prior review — the 3 new commits since 38ff8c23 are:

A merge from main (Prepare azure.ai.agents 0.1.32-preview patch release metadata #8227 brought in extension.yaml version bump, unrelated to this PR''s eval/optimize surface)
cmd/root.go conflict resolution (correct — preserves both the new project/connection commands from main and the new eval/optimize commands from this PR)
A cli/azd/extensions/registry.json edit (schemaVersion field, Zyysurely/azure-dev → Azure/azure-dev URL fix, checksum updates for azure.ai.agents 0.1.31-preview)

None of the 8 🔴 CRITICAL or any 🟠 HIGH findings from my prior review have been addressed. Confirmed unchanged: eval_api/artifacts.go, optimize_apply.go, dataset_api/operations.go, dataset_api/models.go, optimize_api/poller.go, eval_progress.go, service_target_agent.go. go test -short ./internal/... still fails — 24 tests across dataset_api, eval_api, and internal/cmd (same pattern as before).

New ask: please drop the registry.json diff from this PR. Per cli/azd/extensions/azure.ai.agents/AGENTS.md, registry updates ship in a separate PR after the GitHub release is published. The diff in commit 14bf9aab looks like an accidentally-included publish artifact from your fork.

wbreza · 2026-05-21T22:35:06Z

Thanks — most critical/high issues addressed and tests are green on a clean checkout. Nice turnaround. Remaining items before approval:

Dataset.DataURI JSON tag still data_uri (snake) at dataset_api/models.go:28 while sibling DatasetCredential.DataURI uses dataUri. Please confirm the service contract or align casing — GetDataset will silently deserialize to empty if the service emits camelCase.
eval_api/poller.go transient-retry busy-loop: continue after isTransientError skips the interval sleep — will hammer the service on persistent 5xx until MaxAttempts. Move the continue after a time.Sleep(p.Options.Interval) or restructure with a select on the ticker.
SafePath on Windows (opteval/yaml.go): prefer filepath.Rel(baseDir, p) + reject results starting with .. over strings.HasPrefix — avoids drive-letter casing and short-name edge cases.
Extract isTransientError — duplicated verbatim across eval_api/poller.go and optimize_api/poller.go.
Default --eval-model drift still present: optimize.go:139 defaults to gpt-4.1-mini while optimize_config.go:95 defaults to gpt-4o. Pick one (extract to a constant) or document why they differ.

Non-blocking nits worth a follow-up issue if not addressed here:

optimize_api/client.go:268 GetCandidateConfig still returns any — consider json.RawMessage or a typed struct
eval_list.go fan-out is now better (default limit dropped to 10) but still unbounded — a semaphore (≈8–16) would prevent --limit 1000 from firing 1000 concurrent requests
eval.go Long text still mentions only init/run; subcommands update/list/show aren''t discoverable from azd ai agent eval --help
eval_init.go:199 error message names only --gen-instruction while the condition checks 4 inputs

Status summary from my prior reviews: 7 of 8 🔴 CRITICAL fixed (only #8 remains), most 🟠 HIGH fixed, test suite now green (24 prior failures resolved). Big improvement.

Zyysurely added 24 commits May 14, 2026 01:18

[feat] support azd optimize and eval

9f071bc

more bugbash

0efde01

more bug bash

ab3eaa8

add options

b10de46

fix more

9c60cec

check ui

bea8f92

fix

f24e4b3

add azd ai agent eval init command

bbd23ec

server change of targetAttributes

1e26a02

fix for dataset generation

a9dbafe

more fixes

8582bce

more

ad7c5b9

more system prompt

79eb54b

remove fixed data

da028c1

candidate id update

95ad858

Update extension registry

2f0e462

Update extension registry

5f3e3bc

align the demo output format + eval update

55fa397

minor version

67a887b

new version

92ba4d9

reorganize

6ab3512

bug bash

7a4f134

deployment logic

96d268e

review version

38ff8c2

Copilot AI review requested due to automatic review settings May 21, 2026 18:51

Zyysurely requested review from JeffreyCA, hemarina, vhvb1989, wbreza and weikanglim as code owners May 21, 2026 18:51

Zyysurely requested review from rajeshkamal5050, tg-msft, therealjohn, trangevi and trrwilson as code owners May 21, 2026 18:51

Copilot started reviewing on behalf of Zyysurely May 21, 2026 18:51 View session

microsoft-github-policy-service Bot assigned Zyysurely May 21, 2026

change back

14bf9aa

Copilot AI reviewed May 21, 2026

View reviewed changes

Merge branch 'main' into zyying/opteval_merge

83582a5

Zyysurely changed the title ~~[feat] Add support for azd ai agent eval/optimize~~ [feat] Add support for azd ai agent eval + optimize May 21, 2026

fix conflict

bf3292d

github-actions Bot added the ext-agents azure.ai.{agents,connections,inspector,projects,routines,skills,toolboxes} extensions label May 21, 2026

wbreza reviewed May 21, 2026

View reviewed changes

therealjohn approved these changes May 21, 2026

View reviewed changes

address comments

9609c29

yingyingzhaozyying added 5 commits May 21, 2026 17:43

Address more comments

928e96c

fix more

251f1d5

to align the latest option contract

eca2c29

more tune for tool optimization

49ed57d

fix spelling

c5ea305

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] Add support for azd ai agent eval + optimize#8306

[feat] Add support for azd ai agent eval + optimize#8306
Zyysurely wants to merge 33 commits into
Azure:mainfrom
Zyysurely:zyying/opteval_merge

Zyysurely commented May 21, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wbreza left a comment

Uh oh!

wbreza commented May 21, 2026

Uh oh!

wbreza commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

Zyysurely commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Eval Commands (azd ai agent eval)

Optimize Commands (azd ai agent optimize)

Supporting Packages

Key Design Decisions

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wbreza left a comment

Choose a reason for hiding this comment

🔴 CRITICAL — blockers

1. Test suite is broken on a clean checkout (23 failures)

2. Zip-slip / path traversal — eval_api/artifacts.go:76-88 (dataset download)

3. Zip-slip / path traversal — optimize_apply.go:430-447 (candidate file download)

4. SAS tokens written to stderr unconditionally

5. Dataset finalize uses container URI, not blob URI — dataset_api/operations.go:79-113

6. Naive XML substring parse for blob listing — dataset_api/operations.go:322-341

7. optimize_api/poller.go:25-50 — no cap, panics on zero Interval

8. JSON tag casing mismatch — dataset_api/models.go:23-30

🟠 HIGH

🟡 MEDIUM (selected)

🔵 LOW (selected)

Process

What''s working well

Uh oh!

wbreza commented May 21, 2026

Uh oh!

wbreza commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Zyysurely commented May 21, 2026 •

edited

Loading

Eval Commands (`azd ai agent eval`)

Optimize Commands (`azd ai agent optimize`)

2. Zip-slip / path traversal — `eval_api/artifacts.go:76-88` (dataset download)

3. Zip-slip / path traversal — `optimize_apply.go:430-447` (candidate file download)

5. Dataset finalize uses container URI, not blob URI — `dataset_api/operations.go:79-113`

6. Naive XML substring parse for blob listing — `dataset_api/operations.go:322-341`

7. `optimize_api/poller.go:25-50` — no cap, panics on zero `Interval`

8. JSON tag casing mismatch — `dataset_api/models.go:23-30`