1、Scale Out Batch Inference with RayCody Yu,Staff Software EngineerHello Cody YuTech Lead,LLM Performance AnyscalevLLM/SGLang/Apache TVM committerex-Founding Engineer BosonAIex-Senior Applied Scientist AWS AIPhD,Computer Science,UCLA 19We are in a GenAI eraImages generated by OpenAI GPT-4oBatch infere
2、nce is getting high demandMulti-ModalityData SourcesCameraMicPDFSensorTabularTextAudioImageVideoStructured and Unstructured Multi-Modality DataEmbedding ModelsLarge Language ModelsVector DBModel TrainingClassificationKnowledge RetrievalRead(CPU)Pre-process(CPU)Model(GPU)Post-process(CPU)Cloud Storag
3、eChallenges with Batch InferenceScale:Large data scale(100s of GBs,TBs,or more)Reliability:Spot+On demand InstancesCompute:Multi stage+Heterogeneous ComputeFlexibility:Bring any OSS Model&CustomizeSLAs:Focus on high throughput/low cost vs low latencyMulti Layer ApproachRay CoreA scalable AI compute
4、engineRay DataAn efficient and scalable data processing pipeline on RayLLM Inference Engine power by Open Source vLLMThe most popular open source LLM inference frameworkRayScalable AI Compute EngineRay OverviewRay(Distributed)Libraries(Core):A general-purpose distributed execution layerRay Tune/Trai
5、n:TrainingRLlib:Reinforcement learningRay Serve:Online inferenceRay Data:Data processingRemote functions(tasks)and classes(actors)Head nodeWorkerWorker nodeWorkerRayletWorkerWorker nodeWorkerWorkerDriverGlobal ControlService(GCS)RayletRayletDashboard serverPushing Scalability to 1000s of NodesHead n
6、odeWorkerWorker nodeWorkerRayletWorkerWorker nodeWorkerWorkerDriverGlobal ControlService(GCS)RayletRayletDashboard serverPushing Scalability to 1000s of NodesGlobal ControlService(GCS)Used for:Actor schedulingPlacement group schedulingNode resource viewsHead nodeWorkerWorker nodeWorkerRayletWorkerWo