1、Gautam Nayak Soumya PadmanabhaManufacturing of AI Systems:Comprehensive Testing ApproachBackground and challenges with AI manufacturing testingMethodology and key test metricsExperimental resultsConclusions and future worksAgendaTesting is Vital:Testing methodology ensures reliability and performanc
2、e Testing often targets individual components,ignoring the end product goals and wholistic approach to testing to meet those goals Key to AI hardware testing is entire hardware solution under complex AI workloads taking into account the full stack verification of data integrity under workloads,perfo
3、rmance variations,and system robustness in handling of errorsIntroduction:Importance of AI Hardware Testing in manufacturingTraditional SystemsStandardized hardware Predictable workloadsAI SystemsHeterogeneous components(GPUs,ASICs,custom accelerators)Dynamic workloads after deployment,computation/m
4、emory-intensive workloads Need for deep software-hardware co-validationAnomalies go unnoticed in Traditional AI System Testing Traditional vs AI System TestingA comprehensive testing strategy must consider the hardware solution as a wholeIt should specifically take into account the hardwares ability
5、 to support large AI workloadsHierarchical,multi-level testing approachComponent levelServer levelRack levelMulti Rack levelContinuous monitoring and feedback integrationOverview of the Scalable Test Methodology for QualityHardware at different levelsSource:Engineering at Meta,2024Testing OverviewSo
6、urce:S.Padmanabha,et.al.,2025Standalone testing of individual accelerators(GPU,ASIC)Evaluate individual performanceCompare performance variations between all the parts under test Establish a baseline for future performance comparisonsKey actionsFunctional validation with synthetic and real workloads