1、Microsoft Research AsiaTowards Interactive World SimulatorTianyu HeMicrosoft Research Asia4/9/2025Microsoft Research AsiaPolicy Model:(|)World Model:(|,)World ModelDavid Ha,Jrgen Schmidhuber.World models.arXiv:1803.10122.interactive!Microsoft Research AsiaOutline1How to represent the visual world?2H
2、ow to enable interactive visual world modeling?We introduce VidTok,A cutting-edge family of video tokenizers.Interact with action:autoregressive world model on MineCraft.Interact with latent action:human-to-robot cross-embodiment generalization.Interact with video demonstration:zero-shot video imita
3、tion in real-world.Interact with camera viewpoint:explicit world model with underlying 3D structure.Microsoft Research AsiaVidTokEfficient ArchitectureSeparate spatial and temporal sampling reduces computational complexity without sacrificing quality.Advanced QuantizationFinite Scalar Quantization(F
4、SQ)addresses training instability and codebook collapse in discrete tokenization.Enhanced TrainingA two-stage strategypre-training on low-res videos and fine-tuning on high-resboosts efficiency.Reduced frame rates improve motion dynamics representation.A cutting-edge family of video tokenizers that
5、excels in both continuous and discrete tokenizations.Tang et al.VidTok:A Versatile and Open-Source Video Tokenizer.arXiv:2412.13061.Microsoft Research AsiaVidTokLeading Reconstruction Performance.VidTok,trained on a large-scale video dataset,outperforms previous models across all metrics,including P
6、SNR,SSIM,LPIPS,and FVD.Tang et al.VidTok:A Versatile and Open-Source Video Tokenizer.arXiv:2412.13061.https:/ Research AsiaVidTokLeading Reconstruction Performance.VidTok exhibits a distinct advantage in detail reconstruction fidelity and subjective viewing experience.Tang et al.VidTok:A Versatile a