《开放数据湖时代的主模式翻译.pdf》由会员分享,可在线阅读,更多相关《开放数据湖时代的主模式翻译.pdf(12页珍藏版)》请在三个皮匠报告上搜索。
1、Master Schema TranslationsIn the Era of Open Data LakeEric Sun(Coinbase)2025-0611Schema Translation After the Table Format is settled for the Open Data Lake,we enjoy thecentralized catalog like Unity Catalog and Gravitino to inventory orenumerate schemas from Data Lake/Warehouse,OLTP,ML/AI Model,and
2、Stream,yet we still face tedious caveats to map/translate the schema fromone system to another.Data types can be incompatible NoSQL and schemaless system may be too flexibleNested structure and OneOf/Union|Flatten/Shredded|OverloadedIndex,Unique Constraint,Partition,Hash,Sort2is the next frontier to
3、 tameData TypesIDL and programming language have different/less types than SQL.Timestamp,UUID,JSON/JSONB/BJSON/Variant High precision(38)or Decimal256 Unsigned Numerics(uint32,uint64)OneOf(Union)and Enum is expressive in IDL,but difficult in SQLSame field name in a Schemaless DB can be overloaded to
4、 several types Frequently-used predicates cant be nested in a Map or Array or Variant3Different type systems Data Types4How expressive/precise it can be?Does the use case care?IcebergParquetSpark SQLBigQueryArrowClickHouseUnsigned INTFloat16Timestamp(9)PartiallyLogicalPartiallyPartiallyDecimal256BSO
5、NENUMDictionaryIPv4/v6Varchar(n)Things can get lostSome data systems support index and PK,but others dont Such info is lost when transporting data from one system to another Sorting(Z-Order)is crucial for query performance,it can mimic indexRange/List/Hash partitioning helps data distribution as coa
6、rse grain.Shard/Hash/Cluster/Bucket is a key design pattern for some DBs.Extra metadata are required to manage Geo-partitioned DB,wide table,federated table between in-memory&hot&cold storage tiers5Index,Partition,Hash,SortPoint-to-Point(spaghetti)vs Hub-n-Spoke6a standard is neededA metadata model