1、PublicUnleashing LlamasPotential:CPU-Based Fine-TuningQCon SFO 2024 by Rema Hariharan and Anil Rajput*All third-party product,company names and logos are trademarks or registered trademarks and remain the property of their respective holders.Use of them does not imply any affiliation with or endorse
2、ment by them.PublicQCon SFO2018 2019 2024CPUJava PublicQCon SFO2018 2019 2024CPU GPUJava LLMPublicQCon SFO Topic2018 2019 2024CPU GPUJava LLMCPUPublicSurveyBackground in CPU architecture PublicOptimal performance is tango dance between Software and HW platformHW PlatformSoftwareSynchronizationPublic
3、Focus of this talkHardware focused platform features 01Software:Llama,Workloads,Models,Metrics,Characterization,Deployments etc.02Synchronization:Optimization,Tunings Deployment Recommendations for optimal performance 03PublicHardware Platform features NOT the focus of this talk:GPU based platform C
4、PU+GPU based platform Focus:CPU Based Inference PublicHardware Platform features CPUs Cores SMT(Simultaneous MultiThreading)Caches AMD EPYC Chiplet Architecture vs.Unified L3 Memory Capacity and Bandwidth PublicHardware Platform featuresCPUCPUCPUSingle Socket Dual Socket PublicHardware Platform feat
5、uresCPUCPUCPUCoreCoreCoreCoreCoreCoreCoreCoreL3 cache4MB 512 MBCore.Memory,I/O,NIC Controllers etc.DDR MemorySingle Socket Dual Socket CPUPublicHardware Platform featuresCPUCPUCPUCoreCoreCoreCoreCoreCoreCoreCoreL3 cache4MB 512 MBCore.Memory,I/O,NIC Controllers etc.DDR MemoryCoreSingle Socket Dual So
6、cket SMT 0 SMT 1L1 I32KBL1 Data32-64KBL2 I+D512KB-2MBCPUPublicL3 Cache:Unified vs ChipletPublicDual Socket System:12 memory channels Socket 0 (8 CCDs)xGMIExample:Memory bandwidth400 Gbps(Total)40-60 Gbps(each CCD)CCD 1 CCD 2 CCD 3 CCD 4 CCD 5 CCD 6 CCD 7 CCD 8PublicDual Socket System:NPS4 Socket 0 (