(Presentation PDF) Video Models are Zero-shot Learners and Reasoners

Wiedemer et al. 2025

2. Methods

In NLP, prompting replaced task-specific training or adaptation.
A similar paradigm shift is on the horizon in machine vision, facilitated by video models.

Def.)
- The fraction of generated videos that solved the task
Props.)
- Determined by humans
- \(\gt0\). : the model possesses the ability to solve the task
- \(\approx1\). : the model reliably solves the problem irrespective of the random seed

Veo 3 shows emergent zero-shot perceptual abilities well beyond the training task.
Just like LLMs replaced task-specific NLP models, video models will likely replace most bespoke models in computer vision—once they become sufficiently cheap and reliable.

Desc.)
- Based on the perception of objects…
- Form a model of a visual world
  - i.e.) Principles that governs the world.
    - e.g.) Laws of physics
Props.)
e.g.)
- Physical Characteristics
  - Rigid and soft body dynamics and their surface interactions
  - Flammability
  - Air resistance affecting falling objects
  - Buoyancy
  - Optical phenomena : refraction and reflection
  - Additive / Subtractive color mixing
  - Visual Jenga task : removing objects in a physically plausible order
  - Putting objects that fits into a backpack
- Abstract Relationships
  - Distinguishing categories
    - e.g.) toys vs laptop
  - Recognizing patterns, generating variations, and parsing larger wholes into parts
  - Maintaining a memory of the world state across time and camera movements

Desc.)
- Based on the perception of objects and the model that defines their relations…
- Meaningfully alter the perceived modeled world
Tasks)
- Background removal
- Style transfer
- Colorization
- Inpainting
- Outpainting
- Manipulating text elements
- Edit images based on doodle instructions
- Compose scenes from individual components
- Generate novel views of objects and characters
- Smoothly transform one object into another
- Change of perspective, lighting, and appearance (Selfie -> Professional photograph)
- Simulate object manipulation
- Interpret object affordance
- Draw a shape
- Roll a burrito

Desc.)
- Integrating perception, modeling, and manipulation…
- Reasons across dimensions (space and time) over a sequence of manipulation steps
- Applied through CoF.
Tasks)
- Generate a valid graph traversal
- Perform visual BFS on a tree
- Complete visual sequences
- Connect matching colors
- Fit shapes into holes
- Sorting numbers
- Use tools to accomplish a visual task
- Solve simple Sudokus or visual puzzles
- Solve mazes and navigation tasks
- Extrapolate rules from visual examples

Desc.)
- The changes applied frame-by-frame in a generated video
- This is how the video models apply changes across dimensions
- Corresponds to the chain-of-thought (CoT) in LLMs

Frame-by-frame video generation parallels chain-of-thought in language models.
Just like chain-of-thought (CoT) enables language models to reason with symbols, a “chain-of frames” (CoF) enables video models to reason across time and space.