PaperCodex

MobileSAM: Ultra-Fast, Lightweight Image Segmentation for Real-World Applications 5526

MobileSAM is a streamlined, high-performance variant of Meta’s groundbreaking Segment Anything Model (SAM), engineered to deliver the same powerful segmentation…

12/18/2025Image Segmentation, Promptable Segmentation, Zero-shot Object Detection

Show-o: One Unified Transformer for Multimodal Understanding and Generation Across Text, Images, and Videos 1809

In today’s AI landscape, developers and researchers often juggle separate models for vision, language, and video—each with its own architecture,…

12/18/2025Image Generation, Multimodal Understanding, Video Understanding

CleanRL: Readable, Reproducible, and Research-Ready Deep Reinforcement Learning in a Single File 8496

If you’ve ever tried to understand how a deep reinforcement learning (DRL) algorithm truly works—only to get lost in layers…

12/18/2025Algorithm Prototyping, Deep Reinforcement Learning, Reproducible Research

AudioGPT: Build Spoken AI Experiences with Speech, Music, Sound, and Talking Head Generation in One Unified System 10209

AudioGPT is a multimodal AI system that bridges the gap between large language models (LLMs) like ChatGPT and the rich…

12/18/2025Audio Generation, Multimodal AI, Speech Synthesis

IterResearch: Break Through Long-Horizon Reasoning Limits with Markovian State Reconstruction 17551

Long-horizon reasoning is one of the toughest challenges in current AI agent development. Traditional agentic systems, which rely on steadily…

12/18/2025Agentic Search, Iterative Deep Research, Long-horizon Reasoning

CARAFE: Boost Dense Prediction Accuracy with Content-Aware, Lightweight Feature Upsampling 32164

Feature upsampling is a critical but often overlooked component in modern computer vision pipelines. Whether you’re building an object detector,…

12/18/2025Instance Segmentation, Object Detection, Semantic Segmentation

MNN: Run Large Language Models and Vision AI Offline on Mobile with a Lightweight, High-Performance Inference Engine 13694

Mobile Neural Network (MNN) is an open-source, lightweight deep learning inference engine developed by Alibaba Group to bring powerful AI…

12/18/2025Large Language Model Deployment, Multimodal AI, On-device Inference

The Well: 15TB of Diverse Physics Simulations for Training and Benchmarking Surrogate Models in Scientific Machine Learning 1582

If you’re working on machine learning models that aim to emulate or accelerate physics-based simulations—whether in fluid dynamics, astrophysics, or…

12/18/2025Scientific Machine Learning, Spatiotemporal Physics Simulation, Surrogate Modeling

FastVLM: High-Resolution Vision-Language Inference with 85× Faster Time-to-First-Token and Minimal Compute Overhead 7052

Vision Language Models (VLMs) are increasingly central to real-world applications—from mobile assistants that read documents to AI systems that interpret…

12/18/2025Document Understanding, On-Device Multimodal Inference, vision-language modeling

Step-Audio: Unified Speech Understanding and Generation for Real-World Voice Applications 4571

Building intelligent voice interfaces used to mean stitching together separate speech recognition (ASR), text generation, and text-to-speech (TTS) systems—each with…

12/18/2025Multimodal Language Modeling, Speech Generation, Speech Understanding