In an increasingly multilingual and interconnected world, spoken language translation (SLT) has moved beyond academic curiosity to become a critical…
Vocos: High-Quality, Real-Time Neural Vocoder Using Fourier Spectra for Efficient Audio Synthesis 1028
If you’re building or evaluating text-to-speech (TTS), voice cloning, or generative audio systems, the choice of neural vocoder can make…
VideoMamba: Efficient Long- and Short-Term Video Understanding Without the Compute Overhead 1044
Video understanding has long been bottlenecked by two competing demands: capturing fine-grained local motion while simultaneously modeling long-range temporal dependencies.…
MoE-LLaVA: High-Performance Vision-Language Understanding with Sparse, Efficient Inference 2282
MoE-LLaVA (Mixture of Experts for Large Vision-Language Models) redefines efficiency in multimodal AI by delivering performance that rivals much larger…
GRUtopia: Scale Embodied AI Development with a City-Scale Simulated Society for General-Purpose Robots 1138
Developing general-purpose robots that can navigate, interact, and manipulate in real-world urban environments remains one of the most demanding challenges…
IMAGDressing: Generate Controllable, High-Fidelity Virtual Outfits Without Retraining Models 1314
Online fashion retailers, digital content studios, and marketing teams increasingly rely on realistic human imagery to showcase garments—but traditional virtual…
MambaOut: High-Accuracy Vision Models Without the Mamba Overhead 2609
The vision community has recently seen a surge in adopting sequence modeling architectures—especially Mamba—for image tasks. Inspired by its linear…
StudioGAN: A Unified, Reproducible Benchmark for Training and Evaluating GANs at Scale 3482
Generative Adversarial Networks (GANs) have long been at the forefront of realistic image synthesis—but using them effectively in research or…
FlexiViT: One Vision Transformer for All Patch Sizes—Deploy Faster or More Accurate Models Without Retraining 3276
Vision Transformers (ViTs) have become a cornerstone of modern computer vision, offering strong performance across a wide range of tasks.…
3D-Speaker-Toolkit: Multimodal Speaker Verification and Diarization with Acoustic, Semantic, and Visual Fusion 2643
Speaker analysis—whether for verifying identity, recognizing who’s speaking, or separating voices in a multi-person conversation—is a fundamental task in speech…