《2764 - 从开发到优化大规模评估代理.pdf》由会员分享,可在线阅读,更多相关《2764 - 从开发到优化大规模评估代理.pdf(12页珍藏版)》请在三个皮匠报告上搜索。
1、Orlando,FLOctober 69IBM TechXchange 2025Session 2764Evan RiveraIBM,Senior Product ManagerEvaluating Agents at Scale from Development to OptimizationWhat you will learn in this session0102030405Why static tests fall shortA lifecycle approach to evaluationHow to run structured experimentsValue of moni
2、toring agents in productionSteps to scale with confidenceIBM TechXchange|2025 IBM CorporationThe Problem with Static EvaluationAgents face dynamic,open-ended tasksStatic benchmarks miss real-world behaviorBlack-box decisions erode trustMissing feedback loops cause drift and failure3IBM TechXchange|2
3、025 IBM CorporationTraditional ML Fixed datasets Static tests Accuracy is the end metricAgents Dynamic environments Evolving tasks Missing feedbackLifecycle Evaluation FrameworkMonitoringOptimizationDevelopmentIBM TechXchange|2025 IBM Corporation4DeploymentExperiment&IterationTracing&InsightsEvaluat
4、ion&InsightsContinuous ImprovementDemo 1:Agent Experimentationwith watsonx.governanceIBM TechXchange|2025 IBM Corporation5From Dev to Prod:The Missing LinkSuccess in development doesnt guarantee success in productionReal-world environments introduce shifts,anomalies,and unexpected behaviorWithout co
5、ntinuous monitoring,issues go unseen until its too lateIBM TechXchange|2025 IBM Corporation6MonitoringDevelopmentDeploymentDemo 2:Agent Monitoringwith watsonx.governanceIBM TechXchange|2025 IBM Corporation7Agentic AI feature highlightsProviding the tools and capabilities to develop,deploy,manage,eva
6、luate,and govern AI agentsAgentic tool catalogAgent evaluation studioAgent evaluatorsAgent production monitoringPlease note:roadmap items are subject to changewatsonx.governance1Make evaluation continuous,not one-time2Capture rich traces of agent decisions3Measure across multiple dimensions:accuracy