《4137 - 小组讨论:开放数据湖屋的硬件加速.pdf》由会员分享,可在线阅读,更多相关《4137 - 小组讨论:开放数据湖屋的硬件加速.pdf(22页珍藏版)》请在三个皮匠报告上搜索。
1、Velox and the Accelerated AgeOrri ErlingStrategyDown to the metal-Accelerator-first DC Front end,query optimization-New workloads need new intelligence Accelerator upside for analytical processing generally recognizedFirst large opportunities in already GPU adjacent workloads Large scale adoption de
2、pends on new data centersThis in turn depends on accelerator maturity and dominating performance across the boardThe white box era of data for AI is not just query execution,it needs QO as wellAxiom:Velox based end to end solutions:Scale-up,Scale-out,big compute jobsVelox and the Accelerated AgePart
3、 I-Into the MetalVelox must keep leading in computeAn accelerator strategy is necessary for continued relevanceVelox as compute ABI?Works out so far:Neuroblade,NVIDIA,Data Pelago,Voltron,Velox WaveWelcome everybody.Velox and the Accelerated AgePhysicsIf Velox is not the first,then it must be the sma
4、rtest.Understand the platform:How is GPU different?Device internal BW 20+x more than host-device BWHost-device round trip 10us.Costs the same as processing a column of 1M scalarsCPU has 32+K L1 cache per thread.GPU has 100b.GPU memory access is not for free.GPU best throughput takes 500K runnable th
5、reads all the time(e.g.1K threads*100SMs resident and 4x more ready to pick up)CPU gets utilization from every thread being independent,own data,own control,no sync.GPU gets utilization from a thread per row x 1M rows,all with the same control.GPU has 5+x memory throughput.But less memory.So fewer q
6、ueries/files at a time.Extract all intra-query,inter-column parallelism.Part I-Into the MetalDesign for Heterogeneous HardwareDo not make round tripsExtract all latent parallelism-Always 500K+threads runnableCarefully preplan host-device transfersCareful about random vs sequential access,registers o