Jihan Yang, a rising researcher in machine learning and computer vision, is currently a postdoctoral associate at New York University’s Courant Institute of Mathematical Sciences, where he works under Professor Saining Xie on next-generation multimodal AI systems. His research sits at the frontier of Multimodal Large Language Models (MLLMs), with a focus on spatial intelligence, long-video understanding, and agent-based reasoning. Yang has quickly gained recognition within the field through a series of high-impact publications at leading conferences including CVPR, NeurIPS, ICML, and ICLR, where his work explores how AI systems can better perceive, reason about, and interact with complex real-world environments.
Prior to joining NYU, Yang completed his PhD at the University of Hong Kong under the supervision of Professor Xiaojuan Qi, where he built a strong foundation in 3D vision, domain adaptation, and open-world scene understanding. His early work contributed to advancing 3D object detection and semantic segmentation, including influential papers on self-training methods and sim-to-real transfer. Over time, his research evolved toward more ambitious goals—developing unified models that bridge perception and reasoning. This trajectory is reflected in projects such as “Thinking in Space” and “Cambrian,” which investigate how multimodal models can construct and manipulate spatial representations across time, laying the groundwork for more general-purpose AI systems.
Yang’s academic journey began at Sun Yat-sen University, where he earned his undergraduate degree in software engineering, graduating with top honors. Alongside his academic work, he gained industry research experience through internships at leading AI labs including Tencent, SenseTime, and YITU Technology, where he worked on applied computer vision problems. Known for combining theoretical depth with practical impact, Yang has also contributed to widely used open-source tools such as OpenPCDet. His work continues to push the boundaries of how AI systems integrate vision, language, and real-world grounding—positioning him as part of a new generation of researchers shaping the evolution of intelligent, multimodal systems.