1、FlinkBatch SQL Improvements on Lakehouse刘大龙/阿里云研发工程师Streaming lakehouse meetupFlink Batch on Paimon 挑战0101CONTENT目录 0202Flink Batch 核心优化0303后续规划01Flink Batch on Paimon 挑战Streaming Warehouse:Flink+PaimonLogsRDBMSFlink Table StoreFlink Table StoreFlink SQLStreaming&BatchFlink SQLStreaming&BatchbinlogD
2、ata ServingSystemsFlink SQLQueriesFlink Table StoreODSDWDDWSADSFlink SQLStreaming&BatchPaimonPaimonPaimonFlink CDC架构简洁语义统一数据一致成本低廉透明开放Flink Batch 挑战Schema 变更行级更新与删除Snapshot 管理时间旅行查询高效 ETL&Ad-hoc02Flink Batch 核心优化Year Recap of Apache Flink BatchFlink 1.16Flink 1.17Flink 1.18SQL GatewayAutomatic Colle
3、ction of StatisticsDynamic Partition PruningJoin HintAdaptive Hash JoinSpeculative ExecutionUpdate&DeleteDPP Strategy OptimizeBushy Join ReorderAdaptive Local HashAggAdaptive Batch SchedulerLakehouse APIsFlink JDBC DriverRuntime FilterOperator Fusion Codegen2022.102023.032023.09Part1:Lakehouse API E
4、nhanceALTER TABLE(FLINK-21634,FLINK-27237)CREATE/REPLACE TABLE AS SELECT(FLIP-218,FLIP-305,FLIP-303)Data Management APICALL Procedure(FLIP-311)Time Travel(FLIP-308)UPDATE/DELETE(FLIP-282)TRUNCATE TABLE(FLIP-302)Data Management APIPart2:Join 优化Statistics EnhanceAnalyze Table(FLIP-240)手动触发,持久化到 Catalo
5、g统计信息丰富rowCountnullCount,ndvmin,maxavgLen,maxLenSupportReportStatistics(FLIP-231)自动收集,不持久化,更实时Flink CSV&Parquet&ORC Format 已支持Paimon 已支持Planner 优先从 Catalog 中获取统计信息,没有则通过 SupportReportStatistics 方式实时获取45Join HintBroadcast Hash JoinBroadcast small table,build hash tableOnly support equi-join4545Shuffl
6、e Hash JoinSort Merge JoinNested Loop JoinBroadcast small table,spill to disk if too largeSupport both equi-join and no-equi-joinShuffle both side by join key and sortOnly support equi-joinShuffle both side by join key,build hash table use small tableOnly support equi-joinJoin Hint没有统计信息,Planner 给出的