《Shopee 上的实时 Lakehouse.pdf》由会员分享,可在线阅读,更多相关《Shopee 上的实时 Lakehouse.pdf(20页珍藏版)》请在三个皮匠报告上搜索。
1、Real-time Lakehouse ShopeeLuo LiShopee Data Infra0102Practice03OptimizationScenario04Future PlanThe Stream Scenario of Shopee01Building Data Warehouse Based on CDC DataClassic Lambda architecture,resource waste,difficult to ensure the data consistency between stream pipeline and batch pipelineRequir
2、es external asynchronous merge tasks for MOR table,high data latency,resource consumption.Consistency ChallengeHow to build a data warehouse based on the business database and provide a unified data layer for batch processing and stream processing is the first problem we are facing now.Incremental C
3、omputation ScenarioIn the stream pipeline,the downstream task only need the data that has changed by upstream task.The problems of traditional process as below:Maintain a big Flink state as a full-view,hard for the stability of Flink tasksReuse the Flink state is not easyDifficult to trace the data
4、processNeed a changelog kind thing to trace the state of records and dataNear-real-time Dashboard ScenarioTo support near-real-time computing,latency less than 10min,due to the lack of suitable data storage,we tend to adopt a stovepipe data processing model.No data reuse,redundant processingAll the
5、computing logic is piled up in a single task,which makes the task complex and difficult to maintain.Need to an extra storage like ClickhouseNeed a storage layer to keep update the data and support near-realtimeanalysisNeed a storage layer which can support:change log and delta&streamingly update eas
6、y to manage high performance(update&analysis)cost efficient work cooperatively with flink02Typical Practice Based on Flink and PaimonFlink+Paimon Data IngestionWe provide users with data integration services from the database to the data warehouse.We have replaced Hudi with Paimon as the default dat