1、Shifting Left for Better Engineering Efficiency-Migration stories along the wayYing DaiPrincipal Software Engineer Roblox IntroductionA little more about myselfWhat Ive done at RobloxMigrations:A trend to shift leftMetrics:reliability and productivity TestingProductionAgenda1.Background 2.The first
2、migration3.The second migration4.LearningsBackgroundI started at the Telemetry team at RobloxFast user growth!Exciting!Oncall,oncall&oncallThe worst part:Every Sev call,ppl would question Telemetry firstBad production reliability-Low Engineering ProductivityOur own DC.2K micro services.Billions Acti
3、ve Time Series.Take a closer look at the Telemetry problem In house Telemetry tool:from metrics collection&processing,to storage,to visualization Expensive.Inflexible.Slow.Inconsistent aggregation results with other standard tools Some teams have their own Telemetry setup with inconsistent metricsLe
4、ss than 99.8%availabilityWe need a better Telemetry solution!Raw Data CollectorTemp StorageProcessing K,VK,V storeStep 3:DeprecationRemove the old pipeline Claim victoryStep 1:New solutionBuy vs BuildGrafana Enterprise,VictoriaMetricsProductionizeStep 2:Transition Dual-write the underlying dataData
5、consistencyRegenerate alerts&dashboardsPlanOne quarter is enough!Reality Migration of basic tools is very hard.Technically:A lot of customizations were made to the in-house toolAnnotations,Player Globe View,latency differencesEngineers are used/attached to the old toolIt was used for almost 10 years
6、Even if we included a new link to every existing chart,engineers chose to stay with the old tool.If we force engineers to move,it would harm their productivityWhat happened thenInstead of one quarter,it took three quarters100%availability for multiple quarters ReliableMulti region,multi AZError isol