《RSR:快速有效的软件升级.pdf》由会员分享,可在线阅读,更多相关《RSR:快速有效的软件升级.pdf(16页珍藏版)》请在三个皮匠报告上搜索。
1、Rapid Hitful S/W UpgradesAkarsh Gupta(Google)Jason Bos(Cisco)AgendaIntroductionSilicon One Express-BootRSR TypesSONiC IntegrationProblem StatementAI/ML workloads are highly sensitive to packet loss in the fabric network.Software upgrades in fabric need to prevent(or minimize)packet loss.Non-Stop For
2、warding is the preferred software upgrade mechanism on fabric switches.Zero dataplane traffic lossMaximizes fabric availabilityNSF has its own challenges:Large engineering effort to ensure backward compatibility and zero packet loss.Cannot handle all software upgrade use-cases.Cold reboot:Fallback f
3、or NSF upgrades.Pause training jobs,drain racks Exponential increase in training time.Need a fallback software upgrade mechanism that minimizes impact on AI/ML workloads.What is RSR?Rapid Switch Reboot:Near NSF reboot with sub-second dataplane downtime.Express-boot and fast-fast-boot.Dataplane conti
4、nues to forward traffic while CPU is rebooted.Intent reprogrammed is cached in vendor SDK.COMMIT operation:SDK cache is written into the ASIC and pipeline is restarted.Sub-second traffic loss occurs only during COMMIT operation.PreparationPhaseRestorePhaseDisconnect ASICCPU RebootConnect to ASICRepr
5、ogram Intent to CacheRebootPhaseExternal DisconnectReset all tables(except ports)DMA cache into ASICStop PipelineStart PipelineTraffic LossCOMMITNSF/RSR comparisonSDK v1SDK v2SerializeDeserializePersistent storageSDK v1SDK v2Memory DMAFresh configWarmboot/NSFExpress boot/RSRSilicon One Express-BootS
6、tatelessAny-to-Any SDK upgrade or downgrade 50 ms traffic interruptionPorts&Protocols stay upAllow NOS configuration or logic changesE.g.ACL table redefinitionNew P4 program may be loadedSDK v1SDK v2Memory DMAFresh configExpress boot/RSRExpress Boot SequenceV1 DatapathV2 DatapathCritical SectionResu