LLM 缩放定律前沿的最新消息.pptx-三个皮匠报告

1、Update from the LLM scaling laws frontier,Jason Clinton,CISOApril 2025,Leading intelligence increases andcybersecurity implications,Our perspective,2,Research lab,Think tank,Startup,4,Benchmarks double-click,Graduate-level reasoningGPQA Diamond3,Agentic codingSWE-bench Verified2,Agentic tool useTAU-

2、bench,Multilingual Q&AMMMLU,Visual reasoningMMMU(validation),OpenAI o3(high),Gemini 2.5 Pro,83.3%,82.9%,81.7%,63.8%,69.1%,84.0%,Retail70.4%,Airline52%,Airline,Retail,Claude 3.7 Sonnet64K extended thinking,Claude 3.7 SonnetNo extended thinking,68.0%,83.2%,86.1%,75%,71.8%,Retail81.2%,Airline58.4%,78.2

3、%/84.8%,62.3%/70.3%,Anthropic models secure all top positions on the MASK leaderboard*a benchmark designed to measure AI honesty when pressured to make false statements.Anthropics models demonstrate superior alignment with facts under pressure,setting the standard for trustworthy AI.*MASK(Model Alig

4、nment between Statements and Knowledge)evaluates models resistance to providing false information,even when prompted to do so.,Claude models lead on honesty,Claude 3.7 Sonnet with thinking,82.13+1.25,MASK LeaderboardMeasures model honesty under pressure to lie,Claude 3 Opus,79+1.31,Claude 3.5 Sonnet

5、,o1-Pro,61.60+0.86,gpt 4o,60.0+2.07,GPT 4.5 Preview,56.93+4.02,Deepseek R1,57.32+2.58,Gemini 2.5 Pro Experimental,55.93+3.49,72.33+2.45,What is Claudes role in our lives?,7,2027,2024,Claude assistsClaude helps individuals do their current work better,making each person the best version of themselves

6、,2025,Claude collaboratesClaude does hours of independent work for you,on par with experts,expanding what every person or team is capable of,Claude pioneersClaude finds breakthrough solutions to challenging problems that would have taken teams years to achieve,Agents are AI syste

LLM 缩放定律前沿的最新消息.pptx

相关报告