《人工智能评估:从模型测试到生产监控.pdf》由会员分享,可在线阅读,更多相关《人工智能评估:从模型测试到生产监控.pdf(17页珍藏版)》请在三个皮匠报告上搜索。
1、 2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.A I M 3 9 3AI Evaluation:From model testing to production monitoringJesse Manders(He/Him)Sr.Product Manager-TechnologyAWSSandeep Singh(He/Him)Sr.GenAI Data Scientist
2、AWS 2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.AgendaWhat is evaluation:Why&challengesAmazon Bedrock EvaluationsDemo:LLM-as-a-JudgeRAG EvaluationDemo:RAG EvaluationWrap up 2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.What is evaluation?2025,Amazon Web Se
3、rvices,Inc.or its affiliates.All rights reserved.Why is evaluation important?Make quality,cost,and latency tradeoffsAlign to your companys style and brand voiceEvaluate for your specific use cases Evaluate with your companys data Monitor biases,safety,and trust 2025,Amazon Web Services,Inc.or its af
4、filiates.All rights reserved.Model hubMetrics and algosFind datasetsSpin up infrastructureHuman judgmentRecord results,synthesize insightsWeigh tradeoffsCan take weeksRepeat for new apps and modelsEvaluation lifecycle and challenges 2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.
5、2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.Amazon Bedrock Evaluations 2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.Bedrock model and RAG evaluationsModel evaluation:Evaluate,compare,and select the best foundation model for your use caseUse curated datase
6、ts or bring your own for tailored results Use automatic(algos or LLMs)or human evaluation methods Evaluate any model/RAG system/app hosted anywhere12345Evaluate models or RAG retrieval alone or retrieval+generationCompare across multiple evaluation jobsBuilt-in metrics for quality and responsible AI