XZ
Xingyi Zhou
Scientist
New YorkAMI Labs
← Back to Org Chart
Biography

Xingyi Zhou is a machine learning researcher and scientist whose work spans computer vision, object detection, and deep learning. He earned his Ph.D. from the University of Texas at Austin, where he developed foundational research in visual recognition systems.

Prior to his doctoral studies, Zhou demonstrated exceptional aptitude in competitive programming, earning a Gold Medal at the ACM International Collegiate Programming Contest (ICPC) Asia Regional, an achievement that reflects the rigorous algorithmic and mathematical foundations that would later distinguish his research career.

Before joining AMI Labs, Zhou held research scientist positions at two of the world's leading AI organizations: Google Research and Meta AI Research. During this period, he made significant contributions to the field of computer vision, most notably through his work on object detection and representation learning.

He is widely recognized as a lead contributor to CenterPoint, an influential framework for 3D object detection, and to related work on keypoint-based detection methods. His research has been published in top-tier venues, including CVPR, ECCV, and NeurIPS, and has accumulated substantial citations within the academic community and influenced both academic and applied computer vision pipelines.

Career History
2021–2024
Meta AI (FAIR)
Research Scientist
2020–2021
Google Research
Research Scientist
2016–2021
UT Austin
PhD, Computer Science
Key Papers
Detection identifies objects as axis-aligned boxes in an image. Most successful object detectors enumerate a nearly exhaustive list of potential object locations and classify each. This is wasteful, inefficient, and requires additional post-processing. In this paper, we take a different approach. We model an object as a single point --- the center point of its bounding box. Our detector uses keypoint estimation to find center points and regresses to all other object properties, such as size, 3D location, orientation, and even pose. Our center point based approach, CenterNet, is end-to-end differentiable, simpler, faster, and more accurate than corresponding bounding box based detectors. CenterNet achieves the best speed-accuracy trade-off on the MS COCO dataset, with 28.1% AP at 142 FPS, 37.4% AP at 52 FPS, and 45.1% AP with multi-scale testing at 1.4 FPS. We use the same approach to estimate 3D bounding box in the KITTI benchmark and human pose on the COCO keypoint dataset. Our method performs competitively with sophisticated multi-stage methods and runs in real-time.
2019 · arXivLabs
5,804 citations
We present Visual Lexicon, a novel visual language that encodes rich image information into the text space of vocabulary tokens while retaining intricate visual details that are often challenging to convey in natural language. Unlike traditional methods that prioritize either high-level semantics (e.g., CLIP) or pixel-level reconstruction (e.g., VAE), ViLex simultaneously captures rich semantic content and fine visual details, enabling high-quality image generation and comprehensive visual scene understanding. Through a self-supervised learning pipeline, ViLex generates tokens optimized for reconstructing input images using a frozen text-to-image (T2I) diffusion model, preserving the detailed information necessary for high-fidelity semantic-level reconstruction. As an image embedding in the language space, ViLex tokens leverage the compositionality of natural languages, allowing them to be used independently as "text tokens" or combined with natural language tokens to prompt pretrained T2I models with both visual and textual inputs, mirroring how we interact with vision-language models (VLMs). Experiments demonstrate that ViLex achieves higher fidelity in image reconstruction compared to text embeddings--even with a single ViLex token. Moreover, ViLex successfully performs various DreamBooth tasks in a zero-shot, unsupervised manner without fine-tuning T2I models. Additionally, ViLex serves as a powerful vision encoder, consistently improving vision-language model performance across 15 benchmarks relative to a strong SigLIP baseline.
2024 · arXivLabs
10 citations