Safetensors
pe_audio_video

Perception Encoder Audio-Visual (PE-AV)

PE-AV is a state-of-the-art multimodal model that embeds audio, video, audio-video, and text into a joint embedding space. The model enables powerful cross-modal retrieval and understanding across audio, video, and text modalities.

Model Description

PE-AV is trained using contrastive learning to align audio, video, and text representations in a shared embedding space. The model can encode:

  • Audio only: Extract audio embeddings from audio waveforms
  • Video only: Extract visual embeddings from video frames
  • Audio-Video: Extract joint audio-visual embeddings
  • Text: Extract text embeddings optimized for different modality pairs

Model Variants

We release 6 model checkpoints with varying sizes and capabilities:

Model Avg Retrieval Video Frames used
pe-av-small-16-frame 45.2 16 frames
pe-av-base-16-frame 47.0 16 frames
pe-av-large-16-frame 48.2 16 frames
pe-av-small 48.1 all frames
pe-av-base 50.2 all frames
pe-av-large 51.6 all frames

The -16-frame variants sample exactly 16 frames (evenly spaced apart) from each video, while the base variants support variable-length videos.

Quick Start

The model is available in both transformers as well as perception_models libraries

perception_models Usage

import torch
from core.audio_visual_encoder import PEAudioVisual, PEAudioVisualTransform

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load model and transform
model = PEAudioVisual.from_config("pe-av-large", pretrained=True).to(device)
transform = PEAudioVisualTransform.from_config("pe-av-large")

video_files = ["video1.mp4", "video2.mp4"]
descriptions = ["description1", "description2"]
audio_files = ["audio1.wav", "audio2.wav"]

# Process inputs and get embeddings
inputs = transform(videos=video_files, text=descriptions, audio=audio_files).to(device)

with torch.inference_mode(), torch.autocast(device.type, dtype=torch.bfloat16):
    outputs = model(**inputs)

# Access different embeddings
audio_embeds = outputs.audio_embeds  # Audio-only embeddings
visual_embeds = outputs.visual_embeds  # Video-only embeddings
audio_visual_embeds = outputs.audio_visual_embeds  # Joint audio-visual embeddings
audio_text_embeds = outputs.audio_text_embeds  # Text embeddings aligned to audio
visual_text_embeds = outputs.visual_text_embeds  # Text embeddings aligned to video
audio_visual_text_embeds = outputs.audio_visual_text_embeds  # Text embeddings aligned to audio-visual
audio_plus_text_embeds = outputs.audio_plus_text_embeds  # Joint audio and text embedding
visual_plus_text_embeds = outputs.visual_plus_text_embeds  # Joint video and text embedding

# Compute the dot product to get their similarities
audio_visual_similarity = audio_embeds @ visual_embeds.T
# When computing similarity against text embeddings, use the
# appropriate text embedding based on the other modality
audio_text_similarity = audio_embeds @ audio_text_embeds.T
video_text_similarity = visual_embeds @ visual_text_embeds.T

Note that you can omit any of the modalities, and use the same forward method. The corresponding embeddings in output will be None. For example:

inputs = transform(videos=video_files, text=descriptions).to(device)

with torch.inference_mode(), torch.autocast(device.type, dtype=torch.bfloat16):
    outputs = model(**inputs)

audio_embeds = outputs.audio_embeds  # None
visual_embeds = outputs.visual_embeds  # available
audio_visual_embeds = outputs.audio_visual_embeds # None
audio_visual_text_embeds = outputs.audio_visual_text_embeds # None
audio_text_embeds = outputs.audio_text_embeds  # None
visual_text_embeds = outputs.visual_text_embeds  # available
audio_plus_text_embeds = outputs.audio_plus_text_embeds  # None
visual_plus_text_embeds = outputs.visual_plus_text_embeds  # Available

We also provide methods for directly encoding an individual modality:

def encode_video_text(self, input_ids, attention_mask=None)
def encode_audio_text(self, input_ids, attention_mask=None)
def encode_audio_video_text(self, input_ids, attention_mask=None)
def encode_audio(self, input_values, padding_mask=None, input_features=None)
def encode_video(self, pixel_values_videos, padding_mask_videos=None, pe_features=None)
def encode_audio_video(
    self,
    input_values,
    pixel_values_videos,
    padding_mask=None,
    padding_mask_videos=None,
    pe_features=None,  # optionally re-use pre-computed PE features
    input_features=None,  # Optionally re-use pre-computed audio codec features
)
def encode_audio_plus_text(
    self,
    input_ids,
    input_values,
    attention_mask=None,
    padding_mask=None,
    input_features=None  # Optionally re-use pre-computed audio codec features
)
def encode_video_plus_text(
    self,
    input_ids,
    pixel_values_videos,
    attention_mask=None,
    padding_mask_videos=None,
    pe_features=None,  # optionally re-use pre-computed PE features
)

transformers Usage

from transformers import PeAudioVideoModel, PeAudioVideoProcessor
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = PeAudioVideoModel.from_pretrained("facebook/pe-av-large")
processor = PeAudioVideoProcessor.from_pretrained("facebook/pe-av-large")

model = model.to(device)

video_files = ["video1.mp4", "video2.mp4"]
descriptions = ["description1", "description2"]
audio_files = ["audio1.wav", "audio2.wav"]

# Process inputs and get embeddings
inputs = processor(
    videos=video_files, text=descriptions, audio=audio_files, return_tensors="pt", padding=True
)

with torch.inference_mode(), torch.autocast(device.type, dtype=torch.bfloat16):
    outputs = model(**inputs.to(device), return_loss=True)

audio_embeds = outputs.audio_embeds  # Audio-only embeddings
video_embeds = outputs.video_embeds  # Video-only embeddings
audio_video_embeds = outputs.audio_video_embeds  # Joint audio-video embeddings
text_audio_video_embeds = outputs.audio_video_text_embeds  # Text embeddings aligned to audio-video
text_audio_embeds = outputs.text_audio_embeds  # Text embeddings aligned to audio
text_video_embeds = outputs.text_video_embeds  # Text embeddings aligned to video
audio_plus_text_embeds = outputs.audio_plus_text_embeds  # Joint audio and text embedding
video_plus_text_embeds = outputs.video_plus_text_embeds  # Joint video and text embedding

# For classification, you can use the logits_* fields of the output
audio_text_preds = outputs.logits_audio_text.sigmoid()

# The overall loss is also available in the output (requires passing return_loss=True)
loss = outputs.loss

We also provide methods for directly encoding an individual modality:

def get_text_audio_embeds(self, input_ids, attention_mask=None)

def get_text_video_embeds(self, input_ids, attention_mask=None)

def get_text_audio_video_embeds(self, input_ids, attention_mask=None)

def get_audio_embeds(self, input_values, padding_mask=None)

def get_video_embeds(self, pixel_values_videos, padding_mask_videos=None)

def get_audio_video_embeds(
    self,
    input_values: torch.Tensor,
    pixel_values_videos: torch.Tensor,
    padding_mask: Optional[torch.Tensor] = None,
    padding_mask_videos: Optional[torch.Tensor] = None,
    return_audio_embeds: bool = False,
    return_video_embeds: bool = False,
)

def get_audio_plus_text_embeds(
    self,
    input_ids: torch.Tensor,
    input_values: torch.Tensor,
    attention_mask: Optional[torch.Tensor] = None,
    padding_mask: Optional[torch.Tensor] = None,
)

def get_video_plus_text_embeds(
    self,
    input_ids: torch.Tensor,
    pixel_values_videos: torch.Tensor,
    attention_mask: Optional[torch.Tensor] = None,
    padding_mask_videos: Optional[torch.Tensor] = None,
)

Citation

@misc{vyas2025pushingfrontieraudiovisualperception,
      title={Pushing the Frontier of Audiovisual Perception with Large-Scale Multimodal Correspondence Learning},
      author={Apoorv Vyas and Heng-Jui Chang and Cheng-Fu Yang and Po-Yao Huang and Luya Gao and Julius Richter and Sanyuan Chen and Matt Le and Piotr DollΓ‘r and Christoph Feichtenhofer and Ann Lee and Wei-Ning Hsu},
      year={2025},
      eprint={2512.19687},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2512.19687},
}

License

This model is released under the Apache 2.0 license.

Downloads last month
508
Safetensors
Model size
1B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Spaces using facebook/pe-av-base 8

Collection including facebook/pe-av-base

Paper for facebook/pe-av-base