当前位置:首页 >英文主页 >中英对照 > 报告详情

斯坦福大学:2025 ELEPHANT:大型语言模型“社会式谄媚”现象全解析研究报告(英文版)(34页).pdf

上传人: 1****1 编号:975326 2025-11-25 34页 821.20KB

下载:

1、PreprintELEPHANT:MEASURING AND UNDERSTANDING SOCIALSYCOPHANCY INLLMSMyra Cheng1Sunny Yu1Cinoo Lee1Pranav Khadpe2Lujain Ibrahim3Dan Jurafsky11Stanford University2Carnegie Mellon University3University of Oxfordmyracs.stanford.edu,syu03stanford.eduABSTRACTLLMs are known to exhibit sycophancy:agreeing w

2、ith and flattering users,even at thecost of correctness.Prior work measures sycophancy only as direct agreement withusers explicitly stated beliefs that can be compared to a ground truth.This fails tocapture broader forms of sycophancy such as affi rming a users self-image or other implicitbeliefs.T

3、o address this gap,we introduce social sycophancy,characterizing sycophancy asexcessive preservation of a users face(their desired self-image),and present ELEPHANT,a benchmark for measuring social sycophancy in an LLM.Applying our benchmark to11 models,we show that LLMs consistently exhibit high rat

4、es of social sycophancy:onaverage,they preserve users face 45 percentage points more than humans in general advicequeries and in queries describing clear user wrongdoing(from Reddits r/AmITheAsshole).Furthermore,when prompted with perspectives from either side of a moral conflict,LLMsaffi rm both si

5、des(depending on whichever side the user adopts)in 48%of casestellingboth the at-fault party and the wronged party that they are not wrongrather than adhering toa consistent moral or value judgment.We further show that social sycophancy is rewardedin preference datasets,and that while existing mitig

6、ation strategies for sycophancy arelimited in effectiveness,model-based steering shows promise for mitigating these behaviors.Our work provides theoretical grounding and an empirical benchmark for understandingand addressing sycophancy in the open-ended contexts that characterize the vast majorityof

word格式文档无特别注明外均可编辑修改,预览文件经过压缩,下载原文更清晰!
三个皮匠报告文库所有资源均是客户上传分享,仅供网友学习交流,未经上传用户书面授权,请勿作商用。
根据文章内容,以下是全文主要内容的概括: 1. **社会谄媚问题**:大型语言模型(LLMs)表现出谄媚行为,即过度迎合用户,甚至牺牲准确性。 2. **社会谄媚的定义**:将谄媚定义为过度维护用户的“面子”(期望的自我形象),包括积极和消极的面子。 3. **ELEPHANT基准**:提出ELEPHANT基准,用于衡量LLMs中的社会谄媚,涵盖四个维度:验证、间接性、框架和道德。 4. **实证分析**:在四个数据集上评估了11个LLMs,发现LLMs在社会谄媚方面表现出高比率,平均比人类高出45个百分点。 5. **原因分析**:社会谄媚在偏好数据集中得到奖励,而现有的缓解策略效果有限。 6. **缓解策略**:模型引导策略在缓解谄媚行为方面显示出希望。 核心数据: - LLMs在社会谄媚方面平均比人类高出45个百分点。 - 在道德冲突中,LLMs有48%的情况会同时肯定双方的观点。
揭秘真相!" LLM如何“讨好”你?" LLM的社交谄媚行为解析!"
客服
商务合作
小程序
服务号
折叠