Decoding Models Visualizations

Reanimating Images using Neural Representations of Dynamic Stimuli

CVPR 2025 Oral

Carnegie Mellon University

Abstract

While computer vision models have made incredible strides in static image recognition, they still do not match human performance in tasks that require the understanding of complex, dynamic motion. This is notably true for real-world scenarios where embodied agents face complex and motion-rich environments.

Our approach, BrainNRDS (Brain-Neural Representations of Dynamic Stimuli), leverages state-of-the-art video diffusion models to decouple static image representation from motion generation, enabling us to utilize fMRI brain activity for a deeper understanding of human responses to dynamic visual stimuli. Conversely, we also demonstrate that information about the brain's representation of motion can enhance the prediction of optical flow in artificial systems. Our novel approach leads to four main findings: (1) Visual motion, represented as fine-grained, object-level resolution optical flow, can be decoded from brain activity generated by participants viewing video stimuli; (2) Video encoders outperform image-based models in predicting video-driven brain activity; (3) Brain-decoded motion signals enable realistic video reanimation based only on the initial frame of the video; and (4) We extend prior work to achieve full video decoding from video-driven brain activity.

BrainNRDS advances our understanding of how the brain represents spatial and temporal information in dynamic visual scenes. Our findings demonstrate the potential of combining brain imaging with video diffusion models for developing more robust and biologically-inspired computer vision systems.

BrainNRDS Method

Our method (BrainNRDS) consists of a two-stages: first, we predict the object-level optical flow of the viewed video using the fMRI brain activations and initial frame of the video. Second, we use the predicted object-level optical flow to reanimate the initial frame of the video using a motion-conditioned video diffusion model, DragNUWA[1]. Our method can also use initial frames generated from fMRI activity from other methods and better predict the optical flow.

Motion Decoding

Optical flow predictors trained with neural data (Ours) are statistically better than both generative models trained without neural data (No Brain - Stable Video Diffusion (Best) [2]) and generative models that fail to disentangle appearance and motion (MindVideo (Best) [3]). We find that our method conditioned on the initial frame generated by MindVideo (Ours + MindVideo (Best)) better predicts the optical flow than MindVideo. EPE is end-point error, a common metric for optical flow evaluation.

Decoding Models Visualizations

We show the ground truth video, our method appleid to the ground truth initial frame, and our method applied to the initial frame generated by MindVideo (End to End decoding from fMRI).

Ground Truth

BrainNRDS (Ours, Initial Frame)

BrainNRDS (Ours, End to End)

Ground Truth

BrainNRDS (Ours, Initial Frame)

BrainNRDS (Ours, End to End)

Predicting fMRI Activity from Visual Encoders

We analyze the ability of visual encoders (VideoMAE [4] and CLIP ConvNeXt[5]) to predict (encode) fMRI brain activity from visual stimuli. Self-supervised video models, VideoMAE and VideoMAE Large, best predict fMRI brain activity. Details on which brain regions are best predicted by VideoMAE Large are shown in Figure 2. For static image models, semantically-supervised models such as CLIP ConvNeXt, perform best (consistent with [6]), while VC-1[7] performs best among embodied AI models.

Citation

@article{yeung2024reanimating,
  title={Reanimating Images using Neural Representations of Dynamic Stimuli},
  author={Jacob Yeung and Andrew F. Luo and Gabriel Sarch and Margaret M. Henderson and Deva Ramanan and Michael J. Tarr},
  year={2024},
  eprint={2406.02659},
  archivePrefix={arXiv},
  primaryClass={q-bio.NC},
  url={https://arxiv.org/abs/2406.02659},
}

References

Shengming Yin, Chenfei Wu, Jian Liang, Jie Shi, Houqiang Li, Gong Ming, and Nan Duan. Drag-NUWA: Fine-grained control in video generation by integrating text, image, and trajectory. 2023
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023.
Zijiao Chen, Jiaxin Qing, and Juan Helen Zhou. Cinematic mindscapes: High-quality video reconstruction from brain activity. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.
Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuhmann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2818–2829, 2023.
A. Y. Wang, K. Kay, T. Naselaris, M. J. Tarr, and L. Wehbe. Better models of human high-level visual cortex emerge from natural language supervision with a large and diverse dataset. Nature Machine Intelligence, 5(12):1415–1426, 2023.
Arjun Majumdar, Karmesh Yadav, Sergio Arnaud, Yecheng Jason Ma, Claire Chen, Sneha Silwal, Aryan Jain, Vincent-Pierre Berges, Tingfan Wu, Jay Vakil, Pieter Abbeel, Jitendra Malik, Dhruv Batra, Yixin Lin, Oleksandr Maksymets, Aravind Rajeswaran, and Franziska Meier. Where are we in the search for an artificial visual cortex for embodied intelligence? In Thirty-seventh Conference on Neural Information Processing Systems, 2023.

Reanimating Images using Neural Representations of Dynamic Stimuli

Woman in Wheatfield

Ground Truth

BrainNRDS (Ours, Initial Frame)

BrainNRDS (Ours, End to End)

Plane

Ground Truth

BrainNRDS (Ours, Initial Frame)

BrainNRDS (Ours, End to End)

BrainNRDS decodes motion from brain activity to reanimate static images.

Abstract

BrainNRDS Method

Motion Decoding

Decoding Models Visualizations

Predicting fMRI Activity from Visual Encoders

Citation

References