PaperCodex

MobileAgent: Cross-Platform GUI Automation That Understands and Acts Like a Human 6632

Imagine giving a natural language instruction like “Book a round-trip flight from Beijing to Paris on Skyscanner for September 18–21”…

12/11/2025Cross-platform Agent, GUI Automation, Multimodal Reasoning

Moshi: A Real-Time, Full-Duplex Speech-to-Speech Foundation Model for Natural Human-Like Dialogue 9165

Traditional spoken dialogue systems—like those used in virtual assistants or customer service bots—rely on a cascade of disconnected components: voice…

12/11/2025Full-duplex Dialogue, Speech-to-speech Generation, Spoken Language Modeling

Spark-TTS: Zero-Shot, Controllable Text-to-Speech with a Single LLM—No Vocoder, No Flow Matching 10840

Overview In the rapidly evolving landscape of AI-powered speech synthesis, complexity has long been the price of quality. Traditional text-to-speech…

12/11/2025Controllable Speech Generation, Text-to-Speech Synthesis, Zero-Shot Voice Cloning

Trae Agent: Resolve Real-World Software Issues with LLM-Powered, Repository-Aware AI Automation 10232

Overview Software engineering is increasingly becoming a collaboration between humans and intelligent tools. Yet, many developers still face persistent challenges:…

12/11/2025LLM-based Agent Reasoning, Repository-level Code Understanding, Software Issue Resolution

Wan: Open-Source, High-Performance Video Generation That Runs on Consumer GPUs 14878

Overview Video content is no longer a luxury—it’s a necessity. From dynamic marketing campaigns and immersive educational materials to personalized…

12/11/202512/11/2025Image-to-Video Synthesis, Text-to-Video Generation, Video Editing

Step1X-Edit: Open-Source Image Editing That Matches GPT-4o and Gemini2 Flash 1954

Overview Step1X-Edit is a state-of-the-art open-source framework for general-purpose image editing that delivers performance comparable to leading proprietary models like…

12/11/2025Image Editing, Instruction-following Image Generation, Multimodal Reasoning

RLFactory: Plug-and-Play Reinforcement Learning for Multi-Turn LLM Tool Use Without the Complexity 1647

Overview Training large language models (LLMs) to reliably use external tools over multiple conversation turns is a persistent challenge in…

12/11/2025LLM Post-Training, Multi-Turn Agent Training, Reinforcement Learning for Tool Use

EvoAgentX: Automate, Evolve, and Scale Multi-Agent LLM Workflows Without Manual Orchestration 2366

Overview Building reliable, scalable systems with large language models (LLMs) often involves stitching together multiple agents, tools, and prompts—a process…

12/11/2025Agentic Workflows, Evolutionary Optimization, Multi-agent Systems

Agent-S: Automate Any Computer Task Like a Human—With Precision, Planning, and Cross-Platform Generalization 8663

Overview Imagine an AI agent that can sit at your computer, look at the screen, understand what it sees, and…

12/11/2025Computer Use Agent, GUI Automation, Multimodal Reasoning

InstantCharacter: Generate Consistent, High-Fidelity Character Images from a Single Photo—No Fine-Tuning Required 1044

Creating personalized, visually consistent characters is a common need across gaming, animation, virtual avatars, and digital storytelling—but until recently, doing…

12/11/202512/15/2025Character Personalization, Diffusion Transformer Adaptation, Text-to-Image Generation