Probabilistically Programmed Systems See More Like Humans Do

CAMBRIDGE, Mass., Dec. 24, 2021 — MIT researchers have developed a framework to enable machines to see the world more like humans do. The artificial intelligence system for analyzing scenes learns to perceive real-world objects from just a few images. It also perceives scenes in terms of these learned objects.

The team built its framework using probabilistic programming, an AI approach that enables the system to cross-check detected objects against input data to see if the images recorded from a camera are a likely match to any candidate scene. Probabilistic inference allows the system to infer whether mismatches are likely due to noise or to errors in the scene interpretation that need to be corrected by further processing.

To develop the system, which is called 3D Scene Perception via Probabilistic Programming (3DP3), the researchers drew on a concept from the early stages of AI research: Computer vision can be thought of as the inverse of computer graphics. Computer graphics focus on generating images based on the representation of a scene. Computer vision can be seen as the opposite of this process.
This image shows how 3DP3 (bottom row) infers more accurate pose estimates of objects from input images (top row) than deep learning systems (middle row). Courtesy of Massachusetts Institute of Technology.

This image shows how 3DP3 (bottom row) infers more accurate pose estimates of objects from input images (top row) than deep learning systems (middle row). Courtesy of Massachusetts Institute of Technology.

A system called 3DP3 (bottom row) infers more accurate pose estimates of objects from input images (top row) than deep learning systems (middle row). Courtesy of MIT.

Lead author Nishad Gothoskar and his collaborators made this technique more learnable and scalable by incorporating it into a framework built using probabilistic programming.

“Probabilistic programming allows us to write down our knowledge about some aspects of the world in a way a computer can interpret, but at the same time, it allows us to express what we don’t know, the uncertainty. So, the system is able to automatically learn from data and also automatically detect when the rules don’t hold,” said Marco Cusumano-Towner, a co-author of the paper.

To analyze a scene, 3DP3 first learns about the objects in that scene. After being shown only five images of an object, each taken from a different angle, 3DP3 learns the object’s shape and estimates the volume it would occupy in space.

“If I show you an object from five different perspectives, you can build a pretty good representation of that object. You’d understand its color, its shape, and you’d be able to recognize that object in many different scenes,” Gothoskar said.

Edmund Optics - Manufacturing Services 8/24 MR

“This is way less data than deep-learning approaches,” added Vikash K. Mansinghka, principal research scientist and leader of the Probabilistic Computing Project. “For example, the Dense Fusion neural object detection system requires thousands of training examples for each object type. In contrast, 3DP3 only requires a few images per object, and reports uncertainty about the parts of each object’s shape that it doesn’t know.”

The system generates a graph to represent the scene, in which each object is a node and the lines connecting the nodes indicate which objects are in contact. This enables 3DP3 to produce a more accurate estimation of how the objects are arranged. Typical deep-learning approaches rely on depth images to estimate object poses, though these methods do not produce a graph structure of contact relationships. The estimations that they produce are less accurate.

The team compared 3DP3 with several deep-learning systems. It tasked each with estimating the poses of 3D objects in scene. In nearly all cases, 3DP3 generated more accurate poses than other models and performed far better when some objects were partially obstructing others.

Additionally, 3DP3 only needed to see five images of each object, while each of the baseline models that it outperformed needed thousands of objects for training.

Used in conjunction with another model, 3DP3 was able to further improve its accuracy. For instance, a deep-learning model might predict that a bowl is floating slightly above a table, but because 3DP3 has knowledge of the contact relationships and can see that this is an unlikely configuration, it is able to make a correction by aligning the bowl with the table.

“I found it surprising to see how large the errors from deep learning could sometimes be — producing scene representations where objects really didn’t match with what people would perceive. I also found it surprising that only a little bit of model-based inference in our causal probabilistic program was enough to detect and fix these errors,” Mansinghka said. “Of course, there is still a long way to go to make it fast and robust enough for challenging real-time vision systems — but for the first time, we’re seeing probabilistic programming and structured causal models improving robustness over deep learning on hard 3D vision benchmarks.”

The research was presented at the Conference on Neural Information Processing Systems (https://arxiv.org/abs/2111.00312).

Published: December 2021

Glossary

computer vision: Computer vision enables computers to interpret and make decisions based on visual data, such as images and videos. It involves the development of algorithms, techniques, and systems that enable machines to gain an understanding of the visual world, similar to how humans perceive and interpret visual information. Key aspects and tasks within computer vision include: Image recognition: Identifying and categorizing objects, scenes, or patterns within images. This involves training...
machine vision: Machine vision, also known as computer vision or computer sight, refers to the technology that enables machines, typically computers, to interpret and understand visual information from the world, much like the human visual system. It involves the development and application of algorithms and systems that allow machines to acquire, process, analyze, and make decisions based on visual data. Key aspects of machine vision include: Image acquisition: Machine vision systems use various...
neural network: A computing paradigm that attempts to process information in a manner similar to that of the brain; it differs from artificial intelligence in that it relies not on pre-programming but on the acquisition and evolution of interconnections between nodes. These computational models have shown extensive usage in applications that involve pattern recognition as well as machine learning as the interconnections between nodes continue to compute updated values from previous inputs.
computational imaging: Computational imaging refers to the use of computational techniques, algorithms, and hardware to enhance or enable imaging capabilities beyond what traditional optical systems can achieve. It involves the integration of digital processing with imaging systems to improve image quality, extract additional information from captured data, or enable novel imaging functionalities. Principles: Computational imaging combines optics, digital signal processing, and algorithms to manipulate and...

Browse Cameras & Imaging, Lasers, Optical Components, Test & Measurement, and more.