Skip to content

PaperCodex

Subscribe
ESPnet-ST: Open-Source Toolkit for Offline, Simultaneous, and Speech-to-Speech Translation

ESPnet-ST: Open-Source Toolkit for Offline, Simultaneous, and Speech-to-Speech Translation 9641

In an increasingly multilingual and interconnected world, spoken language translation (SLT) has moved beyond academic curiosity to become a critical…

12/26/2025Simultaneous Speech Translation, Speech-to-Speech Translation, Speech-to-text Translation
Vocos: High-Quality, Real-Time Neural Vocoder Using Fourier Spectra for Efficient Audio Synthesis

Vocos: High-Quality, Real-Time Neural Vocoder Using Fourier Spectra for Efficient Audio Synthesis 1028

If you’re building or evaluating text-to-speech (TTS), voice cloning, or generative audio systems, the choice of neural vocoder can make…

12/26/2025Audio Synthesis, Neural Vocoding, Speech Generation
VideoMamba: Efficient Long- and Short-Term Video Understanding Without the Compute Overhead

VideoMamba: Efficient Long- and Short-Term Video Understanding Without the Compute Overhead 1044

Video understanding has long been bottlenecked by two competing demands: capturing fine-grained local motion while simultaneously modeling long-range temporal dependencies.…

12/26/2025Action Recognition, Video Understanding, Video-text Retrieval
MoE-LLaVA: High-Performance Vision-Language Understanding with Sparse, Efficient Inference

MoE-LLaVA: High-Performance Vision-Language Understanding with Sparse, Efficient Inference 2282

MoE-LLaVA (Mixture of Experts for Large Vision-Language Models) redefines efficiency in multimodal AI by delivering performance that rivals much larger…

12/26/2025Multimodal Reasoning, Object Hallucination Reduction, Visual Question Answering
GRUtopia: Scale Embodied AI Development with a City-Scale Simulated Society for General-Purpose Robots

GRUtopia: Scale Embodied AI Development with a City-Scale Simulated Society for General-Purpose Robots 1138

Developing general-purpose robots that can navigate, interact, and manipulate in real-world urban environments remains one of the most demanding challenges…

12/26/2025Embodied AI, Robot Navigation, Sim2Real
IMAGDressing: Generate Controllable, High-Fidelity Virtual Outfits Without Retraining Models

IMAGDressing: Generate Controllable, High-Fidelity Virtual Outfits Without Retraining Models 1314

Online fashion retailers, digital content studios, and marketing teams increasingly rely on realistic human imagery to showcase garments—but traditional virtual…

12/26/2025Controllable Image Generation, Garment-conditioned Synthesis, Virtual Dressing
MambaOut: High-Accuracy Vision Models Without the Mamba Overhead

MambaOut: High-Accuracy Vision Models Without the Mamba Overhead 2609

The vision community has recently seen a surge in adopting sequence modeling architectures—especially Mamba—for image tasks. Inspired by its linear…

12/26/2025Efficient Deep Learning, Image Classification, Vision Backbone
StudioGAN: A Unified, Reproducible Benchmark for Training and Evaluating GANs at Scale

StudioGAN: A Unified, Reproducible Benchmark for Training and Evaluating GANs at Scale 3482

Generative Adversarial Networks (GANs) have long been at the forefront of realistic image synthesis—but using them effectively in research or…

12/26/2025GAN Benchmarking, Generative Modeling, Image Synthesis
FlexiViT: One Vision Transformer for All Patch Sizes—Deploy Faster or More Accurate Models Without Retraining

FlexiViT: One Vision Transformer for All Patch Sizes—Deploy Faster or More Accurate Models Without Retraining 3276

Vision Transformers (ViTs) have become a cornerstone of modern computer vision, offering strong performance across a wide range of tasks.…

12/22/2025Image Classification, Image-text Retrieval, Semantic Segmentation
3D-Speaker-Toolkit: Multimodal Speaker Verification and Diarization with Acoustic, Semantic, and Visual Fusion

3D-Speaker-Toolkit: Multimodal Speaker Verification and Diarization with Acoustic, Semantic, and Visual Fusion 2643

Speaker analysis—whether for verifying identity, recognizing who’s speaking, or separating voices in a multi-person conversation—is a fundamental task in speech…

12/22/2025Multimodal Speech Processing, Speaker Diarization, Speaker Verification

Posts pagination

Previous 1 … 30 31 32 … 53 Next
Copyright © 2026 PaperCodex.
  • Facebook
  • YouTube
  • Twitter

PaperCodex