Neural Network Training Method Boosts Vision Capabilities

In the current AI landscape, sequence models have gained considerable popularity for their ability to analyze data and predict what to do next. Platforms like ChatGPT use next-token prediction to anticipate each word (or token) in a sequence to form answers to user questions. Full-sequence diffusion models like Sora can convert words into realistic visuals by successively “denoising” an entire video sequence.

When applied to computer vision and robotics, the next-token and full-sequence diffusion models have trade-offs in terms of capabilities. While next-token models can deliver sequences of varied length, these generations are made without awareness of desirable states in the future — such as steering its sequence generation toward a certain goal 10 tokens away — and therefore require additional mechanisms for long-horizon (long-term) planning. Diffusion models can perform such future-conditioned sampling, but lack the ability of next-token models to generate variable-length sequences.

At MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), researchers have proposed a simple change to the diffusion training scheme that makes this sequence denoising considerably more flexible by combining the strengths of both models. The developed model is called “diffusion forcing.” The name stems from “teacher forcing,” the conventional training scheme that breaks down full sequence generation into the smaller, easier steps of next-token generation.

The “diffusion forcing” method can sort through noisy data and reliably predict the next steps in a task, helping a robot complete manipulation tasks, for example. In one experiment, it helped a robotic arm rearrange toy fruits into target spots on circular mats despite starting from random positions and visual distractions. Courtesy of MIT CSAIL/Mike Grimmett.

Diffusion forcing found common ground between diffusion models and teacher forcing in that they both use training schemes that involve predicting masked (noisy) tokens from unmasked ones. In the case of diffusion models, they gradually add noise to data, which can be viewed as fractional masking. The MIT researchers’ diffusion forcing method trains neural networks to cleanse a collection of tokens, removing different amounts of noise within each one while simultaneously predicting the next few tokens. The result is a flexible, reliable sequence model that resulted in higher-quality artificial videos and more precise decision-making for robots and AI agents.

By sorting through noisy data and reliably predicting the next steps in a task, Diffusion forcing can aid a robot in ignoring visual distractions to complete manipulation tasks. It can also generate stable and consistent video sequences and even guide an AI agent through digital mazes. This method could potentially enable household and factory robots to generalize to new tasks and improve AI-generated entertainment.

“Sequence models aim to condition on the known past and predict the unknown future, a type of binary masking. However, masking doesn’t need to be binary,” said lead author Boyuan Chen. “With diffusion forcing, we add different levels of noise to each token, effectively serving as a type of fractional masking. At test time, our system can ‘unmask’ a collection of tokens and diffuse a sequence in the near future at a lower noise level. It knows what to trust within its data to overcome out-of-distribution inputs.”

In several experiments, diffusion forcing excelled at ignoring misleading data to execute tasks while anticipating future actions.

When implemented into a robotic arm, for example, it helped swap two toy fruits across three circular mats, a minimal example of a family of long-horizon tasks that require memories. The researchers trained the robot by controlling it from a distance (or teleoperating it) in virtual reality. The robot is trained to mimic the user’s movements from its camera. Despite starting from random positions and seeing distractions like a shopping bag blocking the markers, it placed the objects into its target spots.

Across each demo, diffusion forcing acted as a full sequence model, a next-token prediction model, or both. According to Chen, this versatile approach could potentially serve as a powerful backbone for a “world model,” an AI system that can simulate the dynamics of the world by training on billions of internet videos. This would allow robots to perform novel tasks by imagining what they need to do based on their surroundings.

The team is currently looking to scale up their method to larger datasets and the latest transformer models to improve performance. They intend to broaden their work to build a ChatGPT-like robot brain that helps robots perform tasks in new environments without human demonstration.

“With diffusion forcing, we are taking a step to bringing video generation and robotics closer together,” said Vincent Sitzmann, MIT assistant professor and leader of CSAIL’s Scene Representation group. “In the end, we hope that we can use all the knowledge stored in videos on the internet to enable robots to help in everyday life.”

The research will be presented at NeurIPS (www.doi.org/10.48550/arXiv.2407.01392).