PaperCodex

ESPnet-SpeechLM: Build Speech Language Models Faster with Unified, Reproducible Workflows 9639

Building speech language models (SpeechLMs)—systems that jointly understand and generate both speech and text—is rapidly becoming essential for next-generation voice…

12/18/2025Multimodal Sequence Modeling, Speech Language Modeling, Voice-Driven Agent Development

FinGPT: Open-Source Financial LLMs with Transparent, Global Data Pipelines for Real-World Finance Applications 1284

Large language models (LLMs) are transforming how we interact with data—but in finance, high-quality, domain-specific language models have largely remained…

12/18/2025Algorithmic Trading Signal Generation, Domain-Specific Language Model Fine-tuning, Financial Sentiment Analysis

DeepSeek-VL2: High-Performance Vision-Language Understanding with Efficient Mixture-of-Experts Architecture 5072

DeepSeek-VL2 is an open-source, advanced vision-language model (VLM) built on a Mixture-of-Experts (MoE) architecture, engineered for robust multimodal understanding across…

12/18/2025Document Understanding, Visual Grounding, Visual Question Answering

OmniSafe: Accelerate Safe Reinforcement Learning Research with a Unified, Modular Framework 1031

Reinforcement learning (RL) holds transformative potential for real-world applications—from autonomous vehicles and surgical robots to industrial control systems. Yet, one…

12/18/2025Constrained Policy Optimization, Offline Safe Reinforcement Learning, Safe Reinforcement Learning

EliGen: Achieve Precise Entity-Level Control in AI Image Generation Without Retraining Models 11062

Text-to-image diffusion models have revolutionized creative workflows, but they still struggle with a fundamental limitation: global prompts alone often fail…

12/18/2025Controllable Text-to-image Synthesis, Entity-level Image Generation, Region-guided Diffusion Models

Mini-InternVL: Achieve 90% of Multimodal Performance with Just 5% of Model Size for Edge and Consumer Deployments 9328

In an era where multimodal large language models (MLLMs) are rapidly advancing, a critical barrier remains: most high-performing vision-language models…

12/18/2025Edge AI, Multimodal Reasoning, vision-language modeling

AnimateDiff: Bring Your Custom AI Image Models to Life—Without Retraining 11796

If you’ve spent time fine-tuning a Stable Diffusion model—perhaps with DreamBooth or LoRA—to generate your ideal character, product mockup, or…

12/18/2025Motion Priors Learning, Personalized Animation, Text-to-Video Generation

Seamless: Real-Time, Expressive, and Multilingual Speech Translation for Natural Cross-Language Communication 11720

In today’s globalized world, real-time communication across languages remains a major bottleneck. Traditional speech translation systems often fall short—they output…

12/18/2025Multimodal Machine Translation, Speech-to-Speech Translation, Streaming Speech Translation

Tora: Precisely Control Motion in AI-Generated Videos with Trajectory Guidance 1223

Creating videos with predictable, controllable motion has long been a major challenge in generative AI. While recent diffusion models produce…

12/18/2025Motion Control, Trajectory-guided Synthesis, Video Generation

Gymnasium: A Standardized, Reproducible Interface for Reinforcement Learning Environments 10396

Reinforcement learning (RL) holds immense promise for solving complex decision-making problems—from robotics and game playing to resource optimization and autonomous…

12/18/2025Algorithm Benchmarking, Environment Simulation, Reinforcement Learning