John Nguyen — AMI Labs

Biography

John Nguyen is a Member of Technical Staff at AMI Labs, where he focuses on building visually intelligent systems and advancing research toward unified world models. His work sits at the intersection of multimodal generation, perception, and understanding, with a particular emphasis on open-world intelligence and large-scale system design.

John has contributed to a range of influential research projects exploring the future of AI architectures, including concurrent mixed-modal generation (OneFlow), temporally expansive video generation (Flowception), scaling laws for multimodal pretraining (Beyond Language Modeling), and tokenization-free language modeling (Byte Latent Transformer), which received an ACL Outstanding Paper award.

Alongside his research, he has built widely adopted systems infrastructure. He is the creator of Opacus, a library for differentially private training with over four million downloads, and Papaya, a large-scale asynchronous federated learning platform presented at MLSys 2022 and deployed to millions of users.

Previously a Research Engineer at Meta’s FAIR lab, John worked on large multimodal models and distributed learning systems. He graduated cum laude from the University of California, Davis, earning degrees in Computer Science and Statistics, as well as a Master’s in Computer Science.

Career History

2020-2026

Meta

Research Engineer

Key Papers

Beyond Language Modeling: An Exploration of Multimodal Pretraining

The visual world offers a critical axis for advancing foundation models beyond language. Despite growing interest in this direction, the design space for native multimodal models remains opaque. We provide empirical clarity through controlled, from-scratch pretraining experiments, isolating the factors that govern multimodal pretraining without interference from language pretraining. We adopt the Transfusion framework, using next-token prediction for language and diffusion for vision, to train on diverse data including text, video, image-text pairs, and even action-conditioned video. Our experiments yield four key insights: (i) Representation Autoencoder (RAE) provides an optimal unified visual representation by excelling at both visual understanding and generation; (ii) visual and language data are complementary and yield synergy for downstream capabilities; (iii) unified multimodal pretraining leads naturally to world modeling, with capabilities emerging from general training; and (iv) Mixture-of-Experts (MoE) enables efficient and effective multimodal scaling while naturally inducing modality specialization. Through IsoFLOP analysis, we compute scaling laws for both modalities and uncover a scaling asymmetry: vision is significantly more data-hungry than language. We demonstrate that the MoE architecture harmonizes this scaling asymmetry by providing the high model capacity required by language while accommodating the data-intensive nature of vision, paving the way for truly unified multimodal models.

2026 · Arxiv

4 citations