1、The Microarchitecture of Teslas Exa-Scale Computer Emil Talpes,Douglas Williams,Debjit Das SarmaWhat is DOJO?2Teslas in-house supercomputer for Machine Learning Highly scalable and fully flexible distributed systemOptimized for Neural Network training workloadsGeneral-purpose system capable of adapt
2、ing to new algorithms and applicationsBuilt from grounds up with large systems in mindNot evolved from existing small systemsAnatomy of a distributed system3Distributed systems are built as hierarchies of nesting boxesCPU-Die-Module-Board-Rack-Cabinet-SystemIntegration gets looser as we move outward
3、 lower bandwidth,higher latenciesSystem is described by three modelsCompute architecture of the inner boxCommunication how data moves between boxesSynchronization how events get ordered across the entire systemThis talk describes our way of filling these boxesHigh throughput,general purpose CPUDOJO
4、nodes are full-fledged computers Dedicated CPU,local memory,communication interfaceSuperscalar,multi-threaded organization Optimized for high-throughput math applications rather than control heavy codeCustom ISA optimized for ML kernelsMicroarchitecture of the DOJO nodeProcessing pipeline32B fetch w
5、indow holding up to 8 instructions 8-wide decode handling 2 threads per cycle4-wide scalar scheduler,4-way SMT 2 integer ALUs 2 address units Register file replicated per thread2-wide vector scheduler,4-way SMT 64B wide SIMD unit 8x8x4 matrix multiplication unitsSMT support focuses on single threade
6、d application No virtual memory,limited protection mechanisms,SW-managed sharing of resources Typical application uses 1 or 2 compute threads and 1-2 communication threads1.25MB SRAM per node 400 GBps load,270 GBps storeGather engine 8B and 16B granularityLoad,store,load+execute from local memory Ex