《超级集群构建器:一款用于GPU超级集群拓扑设计与可视化、复杂度分析、总拥有成本估算和性能评估的工具.pdf》由会员分享,可在线阅读,更多相关《超级集群构建器:一款用于GPU超级集群拓扑设计与可视化、复杂度分析、总拥有成本估算和性能评估的工具.pdf(30页珍藏版)》请在三个皮匠报告上搜索。
1、Super Cluster BuilderA Tool for Topology Design and Visualization,Complexity Analysis,TCO Estimation,and Performance Assessment for xPU SuperclustersSuper Cluster BuilderA Tool for Topology Design and Visualization,Complexity Analysis,TCO Estimation,and Performance Assessment for xPU SuperclustersSi
2、amak Tavallaei,Sr.Principal Engineer,Samsung Semiconductor Ardavan Sherafat,AI/ML Researcher,Cal Poly,PomonaSERVER:AI HW/SW CO-DESIGN/NIC/HPCAI workloads are scaling to hundreds of thousands or even millions of GPUs.Designing infrastructure at this scale is a significant challenge requiring careful
3、planning.Interconnect networks become as critical as xPUs for performance and scalability.Traditional clusters can no longer meet the demands of cutting-edge AI models.Poorly designed systems suffer from latency,bandwidth inefficiencies,and physical complexity.Structured evaluation tools are essenti
4、al to analyze performance and cost before deployment.Super Cluster Builder addresses these challenges with a systematic design approach.MotivationA first-order analysis tool and methodology for design-space exploration(DSE)by AI cluster architects.Models large-scale interconnect topologies and their
5、 complexity.Helps visualize the complexity trade-offs to boost productivity ten-fold!Links design choices directly to performance and cost outcomes.Focuses on informing design decisions,not for cycle-accurate or workload executing.Enables comparison of multiple infrastructure designs.Machine descrip
6、tion,datapaths,profiling,bottlenecks,optimizations,test-matrix evaluation Helps optimize for throughput,latency,fault tolerance,and cost(VoT:Value gained for required TCO)What is Super Cluster Builder?Super ClusterVision:1M xPUsPractical Request from programmers:1000 xPUsStretch goal for now:4K xPUs