MobileSAM is a streamlined, high-performance variant of Meta’s groundbreaking Segment Anything Model (SAM), engineered to deliver the same powerful segmentation…
Show-o: One Unified Transformer for Multimodal Understanding and Generation Across Text, Images, and Videos 1809
In today’s AI landscape, developers and researchers often juggle separate models for vision, language, and video—each with its own architecture,…
CleanRL: Readable, Reproducible, and Research-Ready Deep Reinforcement Learning in a Single File 8496
If you’ve ever tried to understand how a deep reinforcement learning (DRL) algorithm truly works—only to get lost in layers…
AudioGPT: Build Spoken AI Experiences with Speech, Music, Sound, and Talking Head Generation in One Unified System 10209
AudioGPT is a multimodal AI system that bridges the gap between large language models (LLMs) like ChatGPT and the rich…
IterResearch: Break Through Long-Horizon Reasoning Limits with Markovian State Reconstruction 17551
Long-horizon reasoning is one of the toughest challenges in current AI agent development. Traditional agentic systems, which rely on steadily…
CARAFE: Boost Dense Prediction Accuracy with Content-Aware, Lightweight Feature Upsampling 32164
Feature upsampling is a critical but often overlooked component in modern computer vision pipelines. Whether you’re building an object detector,…
MNN: Run Large Language Models and Vision AI Offline on Mobile with a Lightweight, High-Performance Inference Engine 13694
Mobile Neural Network (MNN) is an open-source, lightweight deep learning inference engine developed by Alibaba Group to bring powerful AI…
The Well: 15TB of Diverse Physics Simulations for Training and Benchmarking Surrogate Models in Scientific Machine Learning 1582
If you’re working on machine learning models that aim to emulate or accelerate physics-based simulations—whether in fluid dynamics, astrophysics, or…
FastVLM: High-Resolution Vision-Language Inference with 85× Faster Time-to-First-Token and Minimal Compute Overhead 7052
Vision Language Models (VLMs) are increasingly central to real-world applications—from mobile assistants that read documents to AI systems that interpret…
Step-Audio: Unified Speech Understanding and Generation for Real-World Voice Applications 4571
Building intelligent voice interfaces used to mean stitching together separate speech recognition (ASR), text generation, and text-to-speech (TTS) systems—each with…