TRIBE v2 is Meta AI’s new multimodal model that predicts human brain activity from video, audio, and language inputs using large-scale fMRI data, enabling researchers to simulate neural responses without running new brain scans.

Understanding how the human brain reacts to images, sound, and language has been one of the biggest challenges in neuroscience. Traditionally this required expensive brain scans and controlled experiments with human volunteers.

In a significant step forward for computational neuroscience, Meta AI has unveiled TRIBE v2, also known as the Trimodal Brain Encoder. This foundation model is designed to predict how the human brain reacts to almost any visual or auditory stimulus, including sights, sounds, and language. By creating what researchers call a “digital twin” of neural activity, TRIBE v2 promises to transform how scientists study the brain—without needing to scan real people for every new experiment.

The model builds directly on the success of its predecessor, which took first place in the Algonauts 2025 brain-encoding competition. That earlier version, trained on low-resolution functional magnetic resonance imaging (fMRI) data from just four people, already set a new standard. TRIBE v2 scales this approach dramatically. It draws on more than 500 hours of fMRI recordings from over 700 healthy volunteers exposed to a rich mix of images, podcasts, videos, and text. The result is a high-resolution system that predicts brain activity across roughly 70,000 voxels—tiny 3D units that capture changes in blood flow and oxygen levels as markers of neural firing.

What Is Brain Encoding and Why Does It Matter?

To understand TRIBE v2, it helps to start with the basics of how neuroscientists study the brain. Functional MRI, or fMRI, is a non-invasive imaging technique that measures brain activity indirectly. When a person sees a picture, hears music, or reads words, specific regions of the brain become more active. This activity increases blood flow and changes the magnetic properties of blood, which scanners detect as BOLD (blood-oxygen-level-dependent) signals.

Traditional research requires participants to lie still in a scanner while viewing carefully chosen stimuli. Data from these sessions help map which brain areas handle vision, hearing, language, or emotion. However, this process is slow, expensive, and limited. Each new hypothesis needs fresh scans, and results can vary between individuals due to noise in the signals.

Brain-encoding models try to bridge this gap. They use artificial intelligence to learn the relationship between stimuli (what a person sees or hears) and the resulting brain activity. Early models were simple and linear, often limited to one sense at a time. Modern approaches, powered by deep learning, handle complex, real-world inputs like movies or conversations. TRIBE v2 takes this further by integrating three modalities—video, audio, and language—into a single predictive system.

The goal is not just prediction but understanding. A reliable encoder acts like a simulator: feed it a new video, and it outputs what an average human brain would likely do. This computer-based testing accelerates research and reduces the need for new human scanning experiments.

The Algonauts Project: Bridging AI and Neuroscience

TRIBE v2 did not emerge in isolation. It grew out of the Algonauts Project, launched in 2019 to connect biological and artificial intelligence researchers. Named after astronauts who explore space, the project treats the brain as the next frontier for computational exploration. Its recurring challenges invite teams worldwide to build the most accurate models of brain responses to naturalistic stimuli—realistic videos and sounds rather than simple lab patterns.

The 2025 Algonauts challenge asked researchers to predict whole-brain fMRI activity while participants watched naturalistic multimodal videos. More than 60 teams competed, exploring architectures ranging from recurrent networks to modern transformer models.

The winning architecture was developed by a team from Meta AI. Their original TRIBE architecture combined pretrained embeddings from vision, audio, and language models and integrated them using a transformer. The approach topped the leaderboard by a clear margin, demonstrating that a unified multimodal architecture could predict brain activity across different individuals and brain regions more accurately than models focused on a single sensory modality.

This success laid the groundwork for TRIBE v2. While the original competition model relied on low-resolution data from just four participants, the new version scales the approach dramatically—training on hundreds of subjects and producing predictions at far higher spatial resolution. The Algonauts framework continues to promote open science, with teams sharing code and reports that accelerate progress across the field.

Inside TRIBE v2: A Three-Stage Architecture

At its core, TRIBE v2 follows a clean, three-stage pipeline that turns raw sensory input into predicted brain maps.

First comes tri-modal encoding. The model does not start from scratch. Instead, it uses powerful pretrained embeddings—compact numerical representations—from Meta’s own foundation models. Video features come from V-JEPA, which learns visual patterns efficiently. Audio draws on Seamless Communication models for speech and sound. Language relies on embeddings from Llama 3.1. These encoders capture rich details: edges and motion in video, pitch and rhythm in audio, semantics and syntax in text.

Next is universal integration. A transformer network processes these embeddings together. Transformers excel at handling sequences and relationships across time and modalities. Here, the transformer learns “universal representations”—shared patterns that work regardless of the exact stimulus, task, or person. It accounts for how visual, auditory, and linguistic information blend in the brain, such as when watching a movie with dialogue.

Finally, brain mapping applies a lightweight subject-specific layer. This step translates the universal features into predictions for individual fMRI voxels. Because the mapping is lightweight, the model generalizes well to new people it has never seen— a capability called zero-shot prediction.

This design delivers practical advantages. TRIBE v2 handles long-form content like full movies or audiobooks. It predicts activity in near real time on standard hardware. Most importantly, its predictions often match the “average” human brain response more closely than a single noisy fMRI scan from one person.

Compared with earlier versions, v2 offers roughly 70 times higher resolution (70,000 voxels versus about 1,000 cortical parcels) and trains on vastly more data. Performance improves by a factor of two to three on benchmarks involving movies and spoken narratives. The model also follows a scaling pattern familiar from modern AI systems. As training data and compute increase, prediction accuracy improves steadily — similar to the behavior observed in large language models.

Key Capabilities and Scientific Validation

TRIBE v2 shines in zero-shot scenarios. Researchers can input a completely new video in English or another language, a novel task, or data from an unseen subject, and the model still produces accurate predictions. This flexibility opens doors for multilingual studies and personalized neuroscience.

The model also supports in-silico experimentation. When tested on classic paradigms—such as face recognition, object selectivity, or semantic processing—it reproduces findings that took decades of real-world studies to establish. For example, it correctly identifies brain regions tuned to places, bodies, faces, speech, or emotions. By extracting interpretable latent features, the model even maps the fine-grained layout of multisensory integration, showing where vision and sound converge.

These results are not abstract. An interactive demo lets users upload stimuli and compare true versus predicted brain activity on 3D brain surfaces or inflated views. Heatmaps highlight active regions, and ablation studies confirm that all three modalities contribute meaningfully.

Implications for Research, AI, and Medicine

The arrival of TRIBE v2 could reshape multiple fields. In neuroscience, it reduces the cost and ethical burden of human scanning. Hypotheses about perception, attention, or language can be tested rapidly in simulation before confirming with real participants. Clinical researchers studying disorders like autism, aphasia, or dementia may use digital twins to model individual differences and design targeted therapies.

For artificial intelligence, the model offers a new benchmark. By aligning AI representations with human brain activity, developers can create systems that process information more naturally. This alignment may improve robustness, interpretability, and trustworthiness in AI—goals that overlap with the Algonauts vision.

In medicine, digital brain twins could accelerate drug discovery or brain-computer interface (BCI) design. Imagine simulating how a prosthetic device or new therapy affects neural patterns without invasive trials. Long-term, such models might contribute to personalized mental health tools or educational technologies that adapt to how individual brains learn.

Of course, challenges remain. fMRI signals contain noise and individual variation. Ethical questions arise around data privacy, especially with large cohorts of brain scans. Models like TRIBE v2 must avoid over-generalizing cultural or demographic biases present in training data. Transparency is essential, which is why Meta has released the model weights, code, paper, and demo under a non-commercial CC BY-NC license on platforms like Hugging Face and GitHub.

Looking Ahead

TRIBE v2 signals a broader shift in neuroscience—from fragmented, task-specific studies toward unified predictive models of cognition. Its performance suggests that the scaling laws familiar from modern AI may also apply to brain encoding systems. As datasets expand and architectures improve, future versions could capture finer temporal dynamics and potentially incorporate additional senses such as touch or smell.

For now, TRIBE v2 stands as a powerful tool for curiosity-driven science. It offers researchers, clinicians, and AI engineers a new way to explore the workings of the human mind. By transforming neural activity into something that can be simulated and analyzed computationally, the model moves us closer to understanding intelligence—both biological and artificial.