1、Yann ColletData Compression and AIOpenZLData Compression for the Age of AIYann ColletARTIFICIAL INTELLIGENCE(AI)Reaching asymptotic limitsSmall gains for large energy costNew LZ77 format?Some small improvementBut ecosystem cost:confusionConjecture:new entrant must offer significant improvementsOther
2、 variants(LZ78,ROLZ,Grammar,Repair,etc.)Converge towards same limitFundamental assumption:data is a bunch of(undifferentiated)bytesHigh compression algorithms(PPM,BWT,CM,NN,etc.)Too slow for datacentersFormat-specific CompressionBeyond ZstandardA Trivial ExampleConjecture:understanding the data open
3、s new ways to interpret and then better compress the data.LZ alone cant compress this data(no repeated byte)Trivially compressible after deltaSAOSmithsonian Astrophysical ObservatoryCatalog of stars Part of Silesia compression corpus:7,251,944 bytes258,997 starsBinary FormatSAO format description He
4、ader+array of records Star record:28 bytesReal*8 SRA0 Real*8 SDEC0 Character*2 IS Integer*2 MAG V Real*4 XRPM Real*4 XDPMCoordinatesMovementsAttributesSAO Compression comparison Conclusion:Exploiting format specification leads to better compression ratio and better speedzstd-3lzma-9cmixSAO-specificC
5、ompressed Size5,551,1544,416,7743,726,7623,516,303Compression Factor1.311.641.942.06Compression Speed100 MB/s2.9 MB/s0.001 MB/s215 MB/sDecompression Speed750 MB/s45 MB/s0.001 MB/s800 MB/s Skylake core 3.6 GHz,Ubuntu 24.04,clang-19The double edge of format-centric compressionTime to design Time to le
6、arn fundamentalsTime and risks to discover a good solutionRebuild same fundamental unitsTime to optimizeTime to safeguard(intrusions,fuzzing)Tricky deployment and evolutionsDecoders must be deployed first across all receivers;only then can the new encoder be employed.Data changes all the time in Dat