When building real-world computer vision systems—whether for autonomous drones, industrial inspection, or mobile apps—one of the toughest trade-offs is between…
3DDFA_V2: Real-Time, CPU-Efficient 3D Face Alignment for Video and Edge Applications 3081
If you’re building applications that require real-time 3D facial understanding—like video conferencing enhancements, augmented reality filters, biometric verification, or character…
Bunny: High-Performance Multimodal AI Without the Heavy Compute Burden 1046
Multimodal Large Language Models (MLLMs) are transforming how machines understand and reason about visual content. Yet, their adoption remains out…
Step-Video-T2V: Generate High-Quality, Long-Form Videos from Text in English and Chinese 3139
Step-Video-T2V is a state-of-the-art open-source text-to-video foundation model developed by StepFun AI. With 30 billion parameters and the ability to…
GCNet: Boost Vision Models with Lightweight Global Context for Better Accuracy and Efficiency 1217
If you’ve worked on computer vision tasks like object detection or instance segmentation, you’ve likely encountered the challenge of modeling…
GCOPTER: Real-Time, High-Fidelity Multicopter Trajectory Planning with Geometric and Dynamic Constraints 1105
Autonomous multicopters—whether used in drone racing, delivery, inspection, or swarm coordination—face a persistent challenge: generating trajectories that are simultaneously smooth,…
LightningDiT: Break the Reconstruction-Generation Trade-Off with 21.8x Faster, SOTA Image Diffusion 1315
Latent diffusion models (LDMs) have become a cornerstone of modern high-fidelity image generation. However, a persistent challenge has limited their…
PRIME: Boost LLM Reasoning with Token-Level Rewards—No Step-by-Step Labels Needed 1783
If you’re working to improve large language models (LLMs) on hard reasoning tasks—like math problem solving or competitive programming—you’ve likely…
GANformer: Compositional, Controllable Image Generation with Fewer Training Steps 1342
Traditional generative adversarial networks (GANs) often act as “black boxes”—they produce compelling images but offer little insight into how those…
FlagEmbedding: High-Performance, Task-Aware Text Embeddings for Multilingual RAG and Semantic Search 10677
Modern AI applications—from customer support chatbots to enterprise knowledge retrieval—rely heavily on high-quality text embeddings to power semantic search and…