Method Teaches AI Models to Locate Personalized Objects

CAMBRIDGE, Mass., Oct. 31, 2025 — Vision-language models such as GPT-5 often excel at recognizing general objects, like a dog, but they perform poorly at locating personalized objects, such as a specific dog. Researchers from MIT, the MIT-IBM Watson AI Lab, the Weizmann Institute of Science, and other institutes have introduced a training method that teaches vision-language models to localize personalized objects in a scene.

The developed method uses prepared video-tracking data in which the same object is tracked across multiple frames. They designed the dataset so the model must focus on contextual clues to identify the personalized object, rather than relying on knowledge it previously memorized.

When given a few example images showing a personalized object, the retrained model is better able to identify the location of that same pet in a new image. Models retrained with their method outperformed state-of-the-art systems at this task. Importantly, their technique leaves the rest of the model’s general abilities intact. This approach could help AI systems track objects across time and aid in the development of AI-driven assistive technologies that help visually impaired users find certain items in a room.

Large language models (LLMs) can excel at learning from context; if the researchers feed an LLM a few examples of a task, for example, addition problems, it can learn to answer new addition problems based on the context that has been provided. Because vision-language models (VLMs) are essentially an LLM with a visual component connected to it, the researchers thought it would inherit the LLM’s in-context learning capabilities. But this wasn't the case.

“The research community has not been able to find a black-and-white answer to this particular problem yet. The bottleneck could arise from the fact that some visual information is lost in the process of merging the two components together, but we just don’t know,” said Jehanzeb Mizra, an MIT postdoc and senior author of a paper on this technique.

A new training method teaches vision-language generative AI models to localize a personalized object, like a cat named Snoofkin, in a new scene. Courtesy of MIT.

The researchers set out to improve VLM's abilities to do in-context localization, which involves finding a specific object in a new image. They focused on the data used to retrain existing VLMs for a new task, which is a process called fine-tuning. Typical fine-tuning data are gathered from random sources and depict collections of everyday objects.

Lambda Research Optics, Inc. - Large Optics

To fix this problem, the researchers developed a new dataset by curating samples from existing video-tracking data. They cut frames from these videos and structured the dataset so each input would consist of multiple images showing the same object in different contexts, with example questions and answers about its location.

Instead of answering based on context clues, the VLMs identified the object using knowledge gained during pretraining. Therefore, the researchers used pseudonyms rather than actual object category names in the dataset.

The researchers also faced challenges in finding the best way to prepare the data. If the frames are too close together, the background would not change enough to provide data diversity. In the end, finetuning VLMs with this new dataset improved accuracy at personalized localization by about 12% on average. When they included the dataset with pseudo-names, the performance gains reached 21%. As model size increases, their technique leads to greater performance gains.

In the future, the researchers want to study possible reasons VLMs don’t inherit in-context learning capabilities from their base LLMs. In addition, they plan to explore additional mechanisms to improve the performance of a VLM without the need to retrain it with new data.

The research will be presented at the International Conference on Computer Vision. It can be viewed on arXiv (www.doi.org/10.48550/arXiv.2411.13317).

Published: October 2025

Glossary

machine vision: Machine vision, also known as computer vision or computer sight, refers to the technology that enables machines, typically computers, to interpret and understand visual information from the world, much like the human visual system. It involves the development and application of algorithms and systems that allow machines to acquire, process, analyze, and make decisions based on visual data. Key aspects of machine vision include: Image acquisition: Machine vision systems use various...
artificial intelligence: The ability of a machine to perform certain complex functions normally associated with human intelligence, such as judgment, pattern recognition, understanding, learning, planning, and problem solving.

Browse Cameras & Imaging, Lasers, Optical Components, Test & Measurement, and more.