In today’s digital workflows, automating interactions with graphical user interfaces (GUIs)—whether on websites, mobile apps, or desktop software—is a high-value…
VideoRAG: Unlock Long-Form Video Understanding with Retrieval-Augmented Generation for AI-Powered Insights 1356
Imagine being able to ask questions like “What did the professor say about quantum entanglement in Lecture 3?” or “Show…
InspireMusic: Generate High-Fidelity, Long-Form Music from Text or Audio with LLM and Super-Resolution 1254
InspireMusic is an open-source framework that redefines what’s possible in AI-powered music generation. By seamlessly integrating a large language model…
StoryDiffusion: Generate Consistent Long-Form Visual Stories from Text Without Retraining Models 6351
Creating visually coherent sequences of images or videos from text prompts has long been a bottleneck in AI-powered storytelling. While…
TikZero: Generate Editable, Precise Scientific Figures from Text—No Paired Training Data Needed 1650
Creating publication-ready scientific diagrams often requires deep familiarity with vector graphics tools or typesetting systems like LaTeX and TikZ. While…
YuE: Open-Source Foundation Model for Full-Length, Lyrics-Aligned Song Generation in Multiple Languages 5810
Creating a complete, coherent song—complete with expressive vocals, lyrics that match the melody, and stylistically consistent accompaniment—has long been a…
MMaDA: One Unified Model for Text Reasoning, Multimodal Understanding, and Image Generation 1518
Imagine running a single model that can answer complex reasoning questions, understand images and text together, and generate high-quality images…
WebThinker: Autonomous Web Research for Large Reasoning Models That Need Real-Time, Multi-Source Knowledge Synthesis 1366
In today’s fast-evolving information landscape, even the most advanced large reasoning models (LRMs)—such as OpenAI-o1 or DeepSeek-R1—are constrained by their…
SkyReels-V2: The First Open-Source Model for Infinite-Length, Cinematic-Quality Video Generation 5119
Video generation has seen remarkable progress in recent years, yet most models remain limited to short clips—typically 5 to 10…
rStar2-Agent: A 14B Math Reasoning Model That Outsmarts 671B Models with Smarter, Tool-Aware Agentic Reasoning 1356
In the rapidly evolving landscape of large language models (LLMs), bigger isn’t always better—smarter is. Enter rStar2-Agent, a 14-billion-parameter reasoning…