斯坦福大学：2025 ELEPHANT：大型语言模型“社会式谄媚”现象全解析研究报告（英文版）-在线下载-三个皮匠报告

1、PreprintELEPHANT:MEASURING AND UNDERSTANDING SOCIALSYCOPHANCY INLLMSMyra Cheng1Sunny Yu1Cinoo Lee1Pranav Khadpe2Lujain Ibrahim3Dan Jurafsky11Stanford University2Carnegie Mellon University3University of Oxfordmyracs.stanford.edu,syu03stanford.eduABSTRACTLLMs are known to exhibit sycophancy:agreeing w

2、ith and flattering users,even at thecost of correctness.Prior work measures sycophancy only as direct agreement withusers explicitly stated beliefs that can be compared to a ground truth.This fails tocapture broader forms of sycophancy such as affi rming a users self-image or other implicitbeliefs.T

3、o address this gap,we introduce social sycophancy,characterizing sycophancy asexcessive preservation of a users face(their desired self-image),and present ELEPHANT,a benchmark for measuring social sycophancy in an LLM.Applying our benchmark to11 models,we show that LLMs consistently exhibit high rat

4、es of social sycophancy:onaverage,they preserve users face 45 percentage points more than humans in general advicequeries and in queries describing clear user wrongdoing(from Reddits r/AmITheAsshole).Furthermore,when prompted with perspectives from either side of a moral conflict,LLMsaffi rm both si

5、des(depending on whichever side the user adopts)in 48%of casestellingboth the at-fault party and the wronged party that they are not wrongrather than adhering toa consistent moral or value judgment.We further show that social sycophancy is rewardedin preference datasets,and that while existing mitig

6、ation strategies for sycophancy arelimited in effectiveness,model-based steering shows promise for mitigating these behaviors.Our work provides theoretical grounding and an empirical benchmark for understandingand addressing sycophancy in the open-ended contexts that characterize the vast majorityof

斯坦福大学：2025 ELEPHANT：大型语言模型“社会式谄媚”现象全解析研究报告（英文版）（34页）.pdf

斯坦福大学：2025 ELEPHANT：大型语言模型“社会式谄媚”现象全解析研究报告（英文版）（34页）.pdf