MaskVAT

MaskVAT

Video to audio generation model to enhance synchronization

  • Generate sound that matches the scene using visual features
  • Ensure the synchronization between the starting point of sound and visual actions
  • Combining high-quality audio codecs across all frequency bands
  • Design of sequence to sequence masking generation model
  • Balancing audio quality, semantic matching, and time synchronization
  • Competitive compared to existing non codec audio models

Product Details

MaskVAT is a video to audio (V2A) generation model that utilizes the visual features of videos to generate realistic sound that matches the scene. This model emphasizes the synchronization between the starting point of sound and visual actions to avoid unnatural synchronization issues. MaskVAT combines high-quality universal audio codecs across all frequency bands with sequence to sequence masking generation models, which can achieve competitiveness comparable to non codec generated audio models while ensuring high audio quality, semantic matching, and time synchronization.