PaperCodex

ESPnet-ST: Open-Source Toolkit for Offline, Simultaneous, and Speech-to-Speech Translation 9641

In an increasingly multilingual and interconnected world, spoken language translation (SLT) has moved beyond academic curiosity to become a critical…

12/26/2025Simultaneous Speech Translation, Speech-to-Speech Translation, Speech-to-text Translation

Vocos: High-Quality, Real-Time Neural Vocoder Using Fourier Spectra for Efficient Audio Synthesis 1028

If you’re building or evaluating text-to-speech (TTS), voice cloning, or generative audio systems, the choice of neural vocoder can make…

12/26/2025Audio Synthesis, Neural Vocoding, Speech Generation

VideoMamba: Efficient Long- and Short-Term Video Understanding Without the Compute Overhead 1044

Video understanding has long been bottlenecked by two competing demands: capturing fine-grained local motion while simultaneously modeling long-range temporal dependencies.…

12/26/2025Action Recognition, Video Understanding, Video-text Retrieval

MoE-LLaVA: High-Performance Vision-Language Understanding with Sparse, Efficient Inference 2282

MoE-LLaVA (Mixture of Experts for Large Vision-Language Models) redefines efficiency in multimodal AI by delivering performance that rivals much larger…

12/26/2025Multimodal Reasoning, Object Hallucination Reduction, Visual Question Answering

IMAGDressing: Generate Controllable, High-Fidelity Virtual Outfits Without Retraining Models 1314

Online fashion retailers, digital content studios, and marketing teams increasingly rely on realistic human imagery to showcase garments—but traditional virtual…

12/26/2025Controllable Image Generation, Garment-conditioned Synthesis, Virtual Dressing

MambaOut: High-Accuracy Vision Models Without the Mamba Overhead 2609

The vision community has recently seen a surge in adopting sequence modeling architectures—especially Mamba—for image tasks. Inspired by its linear…

12/26/2025Efficient Deep Learning, Image Classification, Vision Backbone

StudioGAN: A Unified, Reproducible Benchmark for Training and Evaluating GANs at Scale 3482

Generative Adversarial Networks (GANs) have long been at the forefront of realistic image synthesis—but using them effectively in research or…

12/26/2025GAN Benchmarking, Generative Modeling, Image Synthesis

FlexiViT: One Vision Transformer for All Patch Sizes—Deploy Faster or More Accurate Models Without Retraining 3276

Vision Transformers (ViTs) have become a cornerstone of modern computer vision, offering strong performance across a wide range of tasks.…

12/22/2025Image Classification, Image-text Retrieval, Semantic Segmentation

3D-Speaker-Toolkit: Multimodal Speaker Verification and Diarization with Acoustic, Semantic, and Visual Fusion 2643

Speaker analysis—whether for verifying identity, recognizing who’s speaking, or separating voices in a multi-person conversation—is a fundamental task in speech…

12/22/2025Multimodal Speech Processing, Speaker Diarization, Speaker Verification