《大型代码语言模型:探索现状、机遇与挑战.pdf》由会员分享,可在线阅读,更多相关《大型代码语言模型:探索现状、机遇与挑战.pdf(57页珍藏版)》请在三个皮匠报告上搜索。
1、Large Language Models for CodeLoubna Ben Allal,Machine Learning Engineer,Science teamAbout me-ML Engineer Hugging Face-Graduated from ENS Paris Saclay&Ecole des Mines de Nancy-Working on LLMs for code&Synthetic data:“The Stack,StarCoder,Cosmopedia.”LoubnaBenAllal1https:/loubnabnl.github.io/How it st
2、arted:GitHub Copilot in 2021ML+Code=Productivity https:/ Engines+ML lead to 6%reduction in code iterations 3%of code generated by model ButAPI:Model:XData:XCode:XHow its going:Over 1.7k open models trained on codeHow did we get here?Strong Instruction-tuned and base modelsHow are code LLMs trained?W
3、hat you need to train(code)LLMs from scratchTransformer ModelUntrained ModelPretrained“Base”ModelSupervised Finetuned(SFT)ModelRLHFChat LLM(e.g.GPT-4)Training Generative AI Models Untrained ModelPretrained“Base”ModelSupervised Finetuned(SFT)ModelRLHFChat LLM(e.g.GPT-4)Training Code LLMsInstruction d
4、ataset for code:“write a function”“solve a bug”.The Landscape of code LLMs The Stack dataset StarCoder StarCoder2 3B,7B,15B sizesStarChat2(with H4 team)DeepSeek-Coder1B,7B,33BDeepSeek-Coder-InstructCodeLlama 7B,13B,70BCodeLlama-InstructOthers:StableCode from StabilityAI,CodeGen from SalesForce&LLMs
5、like Mixtral,DBRX,Qwen&YiBigCode:open-scientific collaborationWe are building LLMs for code in a collaborative way:-Full data transparency-Open source processing and training code-Model weights released with commercial friendly license1100+researchers,engineers,lawyers,and policy makersClosed Source
6、 Training data&sources not disclosedModel weights not public Sending data to external APIsNot reproducibleClosed Source Training data and sources not disclosedModel weights not public Sending data to external APIsNot reproducibleOpen Source Public data with inspection and opt-out toolsModel weights