Quentin Duval — AMI Labs

Biography

Quentin Duval is an accomplished Artificial Intelligence Research Engineer specializing in large-scale machine learning systems, with a strong focus on self-supervised learning for images and video. Currently a Member of Technical Staff at AMI and formerly at Meta’s Facebook AI Research lab in Montréal, he has been a core contributor to state-of-the-art video generation models such as Emu-Video and MovieGen. His work emphasizes optimizing large-scale training architectures, including advanced parallelism techniques, as well as building robust data pipelines and evaluation systems for high-performance AI models.

Prior to his research career, Quentin spent over eight years at Murex in Paris, where he rose to Principal Software Engineer and led a team of developers working on financial trade repository systems. There, he drove major architectural transformations toward scalable, distributed services and modern asynchronous systems.

Quentin combines deep technical expertise in Python, PyTorch, and distributed systems with a strong engineering background in C++ and Java. He is also an active contributor to open-source projects, a conference speaker, and the author of a technical blog. Educated at Telecom ParisTech and the University of Stuttgart, he brings a rare blend of research excellence and industrial leadership.

Career History

2023 - 2026

Meta

AI Research Engineer on Image Representation Learning

2013-2019

Murex

Principal Back Office Software Engineer

Key Papers

Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning

We present Emu Video, a text-to-video generation model that factorizes the generation into two steps: first generating an image conditioned on the text, and then generating a video conditioned on the text and the generated image. We identify critical design decisions--adjusted noise schedules for diffusion, and multi-stage training that enable us to directly generate high quality and high resolution videos, without requiring a deep cascade of models as in prior work. In human evaluations, our generated videos are strongly preferred in quality compared to all prior work--81% vs. Google's Imagen Video, 90% vs. Nvidia's PYOCO, and 96% vs. Meta's Make-A-Video. Our model outperforms commercial solutions such as RunwayML's Gen2 and Pika Labs. Finally, our factorizing approach naturally lends itself to animating images based on a user's text prompt, where our generations are preferred 96% over prior work.

2023 · Arxiv

152 citations

Self-supervised learning from images with a joint-embedding predictive architecture

This paper demonstrates an approach for learning highly semantic image representations without relying on hand-crafted data-augmentations. We introduce the Image-based Joint-Embedding Predictive Architecture (I-JEPA), a non-generative approach for self-supervised learning from images. The idea behind I-JEPA is simple: from a single context block, predict the representations of various target blocks in the same image. A core design choice to guide I-JEPA towards producing semantic representations is the masking strategy; specifically, it is crucial to (a) sample target blocks with sufficiently large scale (semantic), and to (b) use a sufficiently informative (spatially distributed) context block. Empirically, when combined with Vision Transformers, we find I-JEPA to be highly scalable. For instance, we train a ViT-Huge/14 on ImageNet using 16 A100 GPUs in under 72 hours to achieve strong downstream performance across a wide range of tasks, from linear classification to object counting and depth prediction.

2023 · Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

1,008 citations