MIT researchers have developed a framework to enable machines to see the world more like humans do. The artificial intelligence system for analyzing scenes learns to perceive real-world objects from just a few images. It also perceives scenes in terms of these learned objects. The team built its framework using probabilistic programming, an AI approach that enables the system to cross-check detected objects against input data to see if the images recorded from a camera are a likely match to any candidate scene. Probabilistic inference allows the system to infer whether mismatches are likely due to noise or to errors in the scene interpretation that need to be corrected by further processing. To develop the system, which is called 3D Scene Perception via Probabilistic Programming (3DP3), the researchers drew on a concept from the early stages of AI research: Computer vision can be thought of as the inverse of computer graphics. Computer graphics focus on generating images based on the representation of a scene. Computer vision can be seen as the opposite of this process. A system called 3DP3 (bottom row) infers more accurate pose estimates of objects from input images (top row) than deep learning systems (middle row). Courtesy of MIT. Lead author Nishad Gothoskar and his collaborators made this technique more learnable and scalable by incorporating it into a framework built using probabilistic programming. “Probabilistic programming allows us to write down our knowledge about some aspects of the world in a way a computer can interpret, but at the same time, it allows us to express what we don’t know, the uncertainty. So, the system is able to automatically learn from data and also automatically detect when the rules don’t hold,” said Marco Cusumano-Towner, a co-author of the paper. To analyze a scene, 3DP3 first learns about the objects in that scene. After being shown only five images of an object, each taken from a different angle, 3DP3 learns the object’s shape and estimates the volume it would occupy in space. “If I show you an object from five different perspectives, you can build a pretty good representation of that object. You’d understand its color, its shape, and you’d be able to recognize that object in many different scenes,” Gothoskar said. “This is way less data than deep-learning approaches,” added Vikash K. Mansinghka, principal research scientist and leader of the Probabilistic Computing Project. “For example, the Dense Fusion neural object detection system requires thousands of training examples for each object type. In contrast, 3DP3 only requires a few images per object, and reports uncertainty about the parts of each object’s shape that it doesn’t know.” The system generates a graph to represent the scene, in which each object is a node and the lines connecting the nodes indicate which objects are in contact. This enables 3DP3 to produce a more accurate estimation of how the objects are arranged. Typical deep-learning approaches rely on depth images to estimate object poses, though these methods do not produce a graph structure of contact relationships. The estimations that they produce are less accurate. The team compared 3DP3 with several deep-learning systems. It tasked each with estimating the poses of 3D objects in scene. In nearly all cases, 3DP3 generated more accurate poses than other models and performed far better when some objects were partially obstructing others. Additionally, 3DP3 only needed to see five images of each object, while each of the baseline models that it outperformed needed thousands of objects for training. Used in conjunction with another model, 3DP3 was able to further improve its accuracy. For instance, a deep-learning model might predict that a bowl is floating slightly above a table, but because 3DP3 has knowledge of the contact relationships and can see that this is an unlikely configuration, it is able to make a correction by aligning the bowl with the table. “I found it surprising to see how large the errors from deep learning could sometimes be — producing scene representations where objects really didn’t match with what people would perceive. I also found it surprising that only a little bit of model-based inference in our causal probabilistic program was enough to detect and fix these errors,” Mansinghka said. “Of course, there is still a long way to go to make it fast and robust enough for challenging real-time vision systems — but for the first time, we’re seeing probabilistic programming and structured causal models improving robustness over deep learning on hard 3D vision benchmarks.” The research was presented at the Conference on Neural Information Processing Systems (https://arxiv.org/abs/2111.00312).