《以牙还牙:面向 AI GPU 系统的 AI 辅助测试_调试流程和日志分析.pdf》由会员分享,可在线阅读,更多相关《以牙还牙:面向 AI GPU 系统的 AI 辅助测试_调试流程和日志分析.pdf(15页珍藏版)》请在三个皮匠报告上搜索。
1、Tommy Yan,GPU Project Lead,Microsoft AzureAnna Mary Mathew,Director,Microsoft AzureFight fire with fire:AI-assisted test/debug flow and log analysis for AI GPU systemsFight fire with fire:AI-assisted test/debug flow and log analysisfor AI GPU systemsTommy Yan,GPU Project Lead,Microsoft AzureAnna Mar
2、y Mathew,Director,Microsoft AzureTEST&VALIDATIONAI Infrastructure scaling and introduction of new technologies creates unique test validation framework that has massive validation data being created for post processingValidation data uses heterogenous formats Debug with massive data is becoming even
3、 more complexFew of the key areas of debug are oRack level connectivity issuesoPower envelope worst case scenariosoPerformance variation at cluster levelProblem statementAI assisted System Test/Debug Flow and Log AnalysisInterested Logs File patterns to scan(e.g.,*BMCSELListDetail*.csv).Error Match
4、TypeERROR Flag if log line contains any error_text keyword,excluding whitelist_text.Match Text Keywords that indicate a problem(error,fail,critical,).Whitelist Text Known safe/irrelevant phrases to ignore(non-critical,Correctable error,).PASS All pass_text keywords must be present in each log entry.
5、Stop-on-Fail Flag Halt test flow on detection if true.Define Error Signature FileFor interested logs:Pre-Search Treatment07 00 ca 24 c2 96 68 37 01 00 02 02 10 00 ff ffRecordName:DramTest Error OEM Event EvtD 1:1st Error ID(DimmDtrResult):DTR_STATUS_NO_FAILUREEvtD 2:2nd Error ID(Fail count):255Log S
6、earchKnown Good Log CompareResult CategorizationGroup by message patternsSeparate matched vs missing signaturesResult De-duplicationFuzzy matching(80%threshold)Merge similar error signaturesMatched Result Post-Search TreatmentResult AnalysisStop-on-fail as behaviorTriggered as ne