当前位置:首页 >英文主页 >中英对照 > 报告详情

OpenAI:2025权重稀疏Transformer具有可解释电路研究报告(英文版)(31页).pdf

上传人: 1****1 编号:976550 2025-11-27 31页 3.86MB

下载:

1、Weight-sparse transformers have interpretable circuitsLeo Gao1Achyuta Rajaram1Jacob Coxon1Soham V.Govande1Bowen Baker1Dan Mossing1AbstractFinding human-understandable circuits in lan-guage models is a central goal of the fi eld ofmechanistic interpretability.We train models tohave more understandabl

2、e circuits by constrain-ing most of their weights to be zeros,so that eachneuron only has a few connections.To recoverfi ne-grained circuits underlying each of severalhand-crafted tasks,we prune the models to isolatethe part responsible for the task.These circuitsoften contain neurons and residual c

3、hannels thatcorrespond to natural concepts,with a small num-ber of straightforwardly interpretable connectionsbetween them.We study how these models scaleand fi nd that making weights sparser trades off ca-pabilityforinterpretability,andscalingmodelsizeimproves the capability-interpretability fronti

4、er.However,scaling sparse models beyond tens ofmillions of nonzero parameters while preservinginterpretability remains a challenge.In addition totraining weight-sparse models de novo,we showpreliminary results suggesting our method canalso be adapted to explain existing dense models.Our work produce

5、s circuits that achieve an un-precedented level of human understandability andvalidates them with considerable rigor.1.IntroductionWhile neural networks,such as large language models,haverapidly increased in capability in recent years,we still un-derstand very little about how they work.Mechanistic

6、in-terpretability seeks to reverse engineer neural networks andfully understand the algorithms they implement internally.A major diffi culty for interpreting transformers is that theactivations and weights are not directly comprehensible;for example,neurons activate in unpredictable patterns thatdon

word格式文档无特别注明外均可编辑修改,预览文件经过压缩,下载原文更清晰!
三个皮匠报告文库所有资源均是客户上传分享,仅供网友学习交流,未经上传用户书面授权,请勿作商用。
根据文章内容,以下是对全文主要内容的概括: 1. **研究目标**:寻找语言模型中人类可理解的计算电路,提高模型的解释性。 2. **方法**:训练权重稀疏的Transformer模型,通过约束大部分权重为零,使每个神经元只有少量连接。 3. **结果**: - 稀疏模型学习到的任务特定电路比密集模型更简单,电路大小平均缩小16倍。 - 稀疏模型中的神经元激活通常对应简单概念,权重编码了直观的概念间连接。 - 通过桥接技术,可以将稀疏模型与密集模型连接,以解释现有密集模型的行为。 4. **挑战**:稀疏模型的训练和部署效率较低,且难以达到密集模型的能力水平。 5. **未来工作**:探索可扩展的方法以创建可解释的模型,并研究稀疏电路在自动化可解释性中的应用。
如何让AI更易懂?" 稀疏电路如何工作?" AI解释性新突破!"
客服
商务合作
小程序
服务号
折叠