基准的幻象：LLM评估可靠性的基础.pdf

上传人：明****

编号：1013391

2025-12-21

PDF 37页 568.64KB

《基准的幻象：LLM评估可靠性的基础.pdf》由会员分享，可在线阅读，更多相关《基准的幻象：LLM评估可靠性的基础.pdf（37页珍藏版）》请在三个皮匠报告上搜索。

1、I N V 5 1 3Mirage of BenchmarksFoundations of LLM Evaluation ReliabilityMorteza Ziyadi(he/him)Applied Science ManagerSwastik Roy(he/him)Sr.Applied Scientist,AGIAgendaoThe Evaluation LandscapeoThe Challenges of LLM EvaluationoInteractive Session/Q&AoIngredients of Stronger Evaluations Part 1oInteract

2、ive Session/Q&AoIngredients of Stronger Evaluations Part 2oInteractive Session/Q&A2S E S S I O N A C T I V I T YSubmit your questions through out the talk.Scan the QR Code to submit questions3Evaluation LandscapeBecause knowing where we stand reveals where we must go.4ACT 1LLM Evaluation Landscape20

3、12 -2020202120222023202420252026&BeyondGLUEGDPValHELMOld NLP(task-specific)Traditional NLP capabilities like language understandingStatic Knowledge/SkillsFixed datasets testing factual knowledge and reasoning abilitiesStatic(Generation/Grading)Standardized tests evaluating text generation quality an

4、d logical reasoningDynamic BenchmarksContinuously updated evaluations that adapt to prevent overfittingAgentic TasksInteractive evaluations testing autonomous decision-makingReal-World TasksPractical evaluations using tasks from real applicationsRubric-Grounded JudgementStructured evaluation using e

5、xplicit criteriaArena LeaderboardsCompetitive platforms for head-to-head comparisonsSuperGLUEMMLULMArenaYuppIMO30hr codingTau BenchCS-QALAMBADAAIMEAPEXHumanitys Last Exam?GPQAMATHGSM8KHumanEvalAlpacaEvalArena HardMT-BenchDyValDyVal 2SCANDyCodeEvalMMLU ProGAIASWE-BenchAgentBenchSimpleQAStrongREJECTLi

6、veBenchBIG-benchEQ-BenchSQuADHellaswagPiqaWinogradDROPA Typical Evaluation StoryI evaluated our model on a coding benchmark I created(say 500 prompts)Accuracy:85.2%PREMISE/What I was hoping to sayWhat I ended up withManager:6The Challenges of LLM EvaluationBecause understanding is the first step to

基准的幻象：LLM评估可靠性的基础.pdf

相关报告