Text-to-image diffusion models have revolutionized creative workflows, but they still struggle with a fundamental limitation: global prompts alone often fail…
Mini-InternVL: Achieve 90% of Multimodal Performance with Just 5% of Model Size for Edge and Consumer Deployments 9328
In an era where multimodal large language models (MLLMs) are rapidly advancing, a critical barrier remains: most high-performing vision-language models…
AnimateDiff: Bring Your Custom AI Image Models to Life—Without Retraining 11796
If you’ve spent time fine-tuning a Stable Diffusion model—perhaps with DreamBooth or LoRA—to generate your ideal character, product mockup, or…
Seamless: Real-Time, Expressive, and Multilingual Speech Translation for Natural Cross-Language Communication 11720
In today’s globalized world, real-time communication across languages remains a major bottleneck. Traditional speech translation systems often fall short—they output…
Tora: Precisely Control Motion in AI-Generated Videos with Trajectory Guidance 1223
Creating videos with predictable, controllable motion has long been a major challenge in generative AI. While recent diffusion models produce…
Gymnasium: A Standardized, Reproducible Interface for Reinforcement Learning Environments 10396
Reinforcement learning (RL) holds immense promise for solving complex decision-making problems—from robotics and game playing to resource optimization and autonomous…
Search-o1: Boost Large Reasoning Models with On-Demand Knowledge Retrieval for Complex Problem Solving 1119
Large reasoning models (LRMs)—such as OpenAI’s o1—excel at multi-step logical reasoning, especially in science, math, and code-related tasks. But they…
ReCamMaster: Reshoot Any Video with New Camera Movements—No 3D Assets or Multi-Camera Setup Needed 1655
Imagine being able to take a single, static video shot on your phone and instantly transform it into a cinematic…
AgentCPM-GUI: On-Device AI Agent for Bilingual Mobile Automation with Reinforcement Fine-Tuning 1142
AgentCPM-GUI is an open-source, on-device large language model (LLM) agent designed to understand smartphone screenshots and autonomously perform user-specified tasks…
UQLM: Detect LLM Hallucinations with Uncertainty Quantification—Confidence Scoring Made Practical 1079
Large Language Models (LLMs) are transforming how we build intelligent applications—from customer service bots to clinical decision support tools. Yet…