当前位置：首页 > 报告详情

开放数据湖时代的主模式翻译.pdf

上传人： Fl****zo 编号：718869 2025-06-22 PDF PDF 12页 670.55KB

该报告所属合集： 2025年数据和人工智能峰会（data+ai summit2025）演讲PPT合集

打包下载报告合集

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载报告到电脑，查找使用更方便

VIP专享文档

书签

分享

收藏

已收藏

版权投诉

/12

立即下载

word格式文档无特别注明外均可编辑修改，预览文件经过压缩，下载原文更清晰！

三个皮匠报告文库所有资源均是客户上传分享，仅供网友学习交流，未经上传用户书面授权，请勿作商用。

《开放数据湖时代的主模式翻译.pdf》由会员分享，可在线阅读，更多相关《开放数据湖时代的主模式翻译.pdf（12页珍藏版）》请在三个皮匠报告上搜索。

1、Master Schema TranslationsIn the Era of Open Data LakeEric Sun(Coinbase)2025-0611Schema Translation After the Table Format is settled for the Open Data Lake,we enjoy thecentralized catalog like Unity Catalog and Gravitino to inventory orenumerate schemas from Data Lake/Warehouse,OLTP,ML/AI Model,and

2、Stream,yet we still face tedious caveats to map/translate the schema fromone system to another.Data types can be incompatible NoSQL and schemaless system may be too flexibleNested structure and OneOf/Union|Flatten/Shredded|OverloadedIndex,Unique Constraint,Partition,Hash,Sort2is the next frontier to

3、 tameData TypesIDL and programming language have different/less types than SQL.Timestamp,UUID,JSON/JSONB/BJSON/Variant High precision(38)or Decimal256 Unsigned Numerics(uint32,uint64)OneOf(Union)and Enum is expressive in IDL,but difficult in SQLSame field name in a Schemaless DB can be overloaded to

4、 several types Frequently-used predicates cant be nested in a Map or Array or Variant3Different type systems Data Types4How expressive/precise it can be?Does the use case care?IcebergParquetSpark SQLBigQueryArrowClickHouseUnsigned INTFloat16Timestamp(9)PartiallyLogicalPartiallyPartiallyDecimal256BSO

5、NENUMDictionaryIPv4/v6Varchar(n)Things can get lostSome data systems support index and PK,but others dont Such info is lost when transporting data from one system to another Sorting(Z-Order)is crucial for query performance,it can mimic indexRange/List/Hash partitioning helps data distribution as coa

6、rse grain.Shard/Hash/Cluster/Bucket is a key design pattern for some DBs.Extra metadata are required to manage Geo-partitioned DB,wide table,federated table between in-memory&hot&cold storage tiers5Index,Partition,Hash,SortPoint-to-Point(spaghetti)vs Hub-n-Spoke6a standard is neededA metadata model

本文讨论了在开放数据湖时代，面对不同系统间模式（schema）转换的挑战。关键点如下： 1. **模式转换问题**：在统一表格格式后，数据湖/仓库、OLTP、ML/AI模型及数据流等仍面临数据类型不兼容、NoSQL系统过于灵活、嵌套结构处理困难等问题。 2. **数据类型差异**：不同系统间数据类型表达力和精度不同，例如高精度数值、时间戳、UUID等。 3. **索引和分区信息丢失**：一些系统特有的设计模式如索引、分区等，在数据传输过程中可能会丢失。 4. **元数据和模式管理**：提出了需要一个包含丰富类型、具有连接、特性/能力、主键/唯一键、索引、分区等信息的元数据模型。 5. **自动化和API**：介绍了“Schemaster”工具，提供REST和gRPC API，用于自动化数据摄入流程，并支持脚本语言代码生成。 6. **模式转换的标准化**：强调需要一个标准化的逻辑类型，以保持不同系统间数据质量和语义的连贯性。 7. **反向ETL和应用**：提出模式转换在数据发布和跨系统共享中的重要性，以及对于数据治理和AI生成的应用。文章强调了通过自动化和标准化来提高跨系统模式转换效率的重要性，并寻求合作伙伴共同维护这一标准。

"如何应对不同系统间的Schema难题？" "数据湖时代，怎样自动化Schema转换？" "Schema标准化的路上，你准备好了吗？"

全行业研究报告分享下载平台

0731-84720580
商务合作：really158d
友链申请 (QQ)：1737380874

关于我们

更多

关于我们

三个皮匠报告微信公众号

三个皮匠报告微信小程序

扫码咨询网站充值下载问题

友情链接：

营销自动化亿欧智库微播易阿里妈妈

copyright@2008-2013 长沙景略智创信息技术有限公司版权所有网站备案/许可证号：湘B2-20190120 | 工信部备案号：湘ICP备17000430号-2 | 公安备案号：湘公网安备43010402001071号

客服

小程序

服务号

折叠