《HC2022.KAIST.SeongminHong.v03.pdf》由会员分享,可在线阅读,更多相关《HC2022.KAIST.SeongminHong.v03.pdf(17页珍藏版)》请在三个皮匠报告上搜索。
1、DFX:A Low-latency Multi-FPGA Appliance for Accelerating Transformer-basedText GenerationSeongmin Hong1,Seungjae Moon1,Junsoo Kim1,Sungjae Lee2,Minsub Kim2,Dongsoo Lee2,and Joo-Young Kim11CastLab,School of EE,KAIST,2NAVER CLOVAHOTCHIPS22 Poster SessionAbstract DFX:a low-latency multi-FPGA appliance f
2、or accelerating transformer-based text generationHOTCHIPS22 DFX is a multi-FPGA appliance that accelerates transformer-based text generation DFX adopts model parallelism to efficiently process the large-scale language model Xilinx Alveo U280 data center accelerator card provides high performance wit
3、h low-cost FPGA-to-FPGA communication is enabled by QSFP cable at 100 Gb/s2 of 17HOTCHIPS22Motivation3 of 17Transformer-based Text Generation Text generation Automatic generation of human-readable text by a computer Example:dialogue system,topic-to-essay generation,and code generation Generative Pre
4、-trained Transformer(GPT)State-of-the-art model in natural language processing that scales up to 175B parameters High-quality text generation and remarkable inference accuracy for benchmarks(e.g.,86.4%for LAMBADA)isHello,my nameInput TokensJames SmithandOutput Tokens.LanguageModelLanguageModelLangua
5、geModel.Generation StageSummarization Stage.LanguageModelGPTDecoder LayerDecoder LayerDecoder LayerDecoder LayerLanguageModelHOTCHIPS224 of 17Challenges of Transformer-based Text Generation1)System bottleneck in the generation stage due to its sequential characteristic2)Massive model parameters and
6、computational requirements3)Lack of deployable hardware with end-to-end capability for GPT inference in datacentersHOTCHIPS22Every operation matters for acceleration!5 of 17HOTCHIPS22DFX Architecture6 of 17DFX Appliance Architecture Multi-FPGA appliance for the acceleration of text generation Intra-