《3741 - 利用 Apache Gluten 提升 Pinterest 的数据查询性能.pdf》由会员分享,可在线阅读,更多相关《3741 - 利用 Apache Gluten 提升 Pinterest 的数据查询性能.pdf(20页珍藏版)》请在三个皮匠报告上搜索。
1、IBM TechXchangeOctober 2025Enhancing Pinterests Data PlatformA 2025 Update on Apache Gluten Integration&our Spark PlatformFelixSoftware EngineerAboutBig Data Query Platform TeamSparkSQLTrinoZaheenEngineering ManagerOur PlatformA History of SparkPlatform IntegrationPerformanceChallenges&Learnings1234
2、5AgendaFuture Plans6Spark Task Retry7SparkSQLOur Platform60k+Daily Scheduled SparkSQL queries15k+Worker InstancesK8sMigrating from YARN8k+Daily Adhoc SparkSQL queries500+Daily Adhoc usersCelebornShuffle ServiceSource:Pinterest internal data;Global analysis;Q3 202542025 PinterestApacheA History of Sp
3、ark-Timeline Moved from Hive to Spark 2.4 in late 2021 Spark provided significant performance gains over Hive In late 2022 we moved to Spark 3.2 Spark costs were exploding Compute became memory bound Various projects introduced to reduce memory Little we can do with vanilla Spark to improve performa
4、nce52025 Pinterest62025 PinterestA History of Spark-ArchitectureA History of Spark-Always Improving The query platform team is always looking into ways to improve query performance Even a 10%improvement would result in significant savings for the business So we began looking at products that could h
5、elp us72025 PinterestA History of Spark-Our Requirements Produce speed ups of at least 10%to make the ROI worth it Migration impact to users and difficulty of Migration Reduction in memory usage A Frictionless experience Users should be able to have speed ups without doing anything We shouldnt have
6、to re-architect our entire system82025 PinterestA History of Spark-The market There are many solutions on the market Photon Nvidia Rapids DataFusion Starrocks ClickHouse Comet Velox92025 PinterestWhy Gluten+Velox Increasing job requirements from customers ML jobs Clusters are often memory bound Larg