Skip to content

PaperCodex

Subscribe
Chinese CLIP: Enable Zero-Shot Chinese Vision-Language AI Without Custom Training

Chinese CLIP: Enable Zero-Shot Chinese Vision-Language AI Without Custom Training 5695

Multimodal AI models like OpenAI’s CLIP have transformed how developers build systems that understand both images and text. But there’s…

12/27/2025Cross-Modal Retrieval, Vision-language Pretraining, Zero-shot Image Classification
XLNet: Bidirectional Language Understanding Without Masked Input Limitations

XLNet: Bidirectional Language Understanding Without Masked Input Limitations 6180

XLNet is a breakthrough in language modeling that effectively bridges the gap between autoregressive (AR) and autoencoding (AE) pretraining paradigms.…

12/27/2025Question Answering, Reading Comprehension, Text Classification
Qwen-VL: Open-Source Vision-Language AI for Text Reading, Object Grounding, and Multimodal Reasoning

Qwen-VL: Open-Source Vision-Language AI for Text Reading, Object Grounding, and Multimodal Reasoning 6422

In the rapidly evolving landscape of multimodal artificial intelligence, developers and technical decision-makers need models that go beyond basic image…

12/27/2025Multimodal Reasoning, vision-language modeling, Visual Question Answering
NeMo: Build Production-Grade Speech, LLM, and Multimodal AI Faster with NVIDIA’s Optimized Framework

NeMo: Build Production-Grade Speech, LLM, and Multimodal AI Faster with NVIDIA’s Optimized Framework 16305

NVIDIA NeMo is a cloud-native, open-source framework designed for developers, research engineers, and technical decision-makers who need to build, customize,…

12/27/2025Automatic Speech Recognition, Large Language Models, Multimodal Learning
MetaCLIP: Superior Vision-Language Models Through Transparent, High-Quality Data Curation

MetaCLIP: Superior Vision-Language Models Through Transparent, High-Quality Data Curation 1692

If you’ve worked with OpenAI’s CLIP, you know its power—but also its opacity. CLIP revolutionized zero-shot vision-language understanding, yet it…

12/27/2025Contrastive Learning, Multilingual Vision-language Modeling, Zero-shot Image Classification
SPIN: Boost Your LLM’s Performance Without New Human Annotations—Just Use Self-Play Fine-Tuning

SPIN: Boost Your LLM’s Performance Without New Human Annotations—Just Use Self-Play Fine-Tuning 1206

Imagine you’ve fine-tuned a language model using a standard Supervised Fine-Tuning (SFT) dataset—like Zephyr-7B on UltraChat—but you don’t have access…

12/27/2025Language Model Alignment, Preference-Free Optimization, Self-Supervised Fine-Tuning
RFBNet: High-Accuracy, Real-Time Object Detection Without Heavy Backbones

RFBNet: High-Accuracy, Real-Time Object Detection Without Heavy Backbones 1422

When building real-world computer vision systems—whether for autonomous drones, industrial inspection, or mobile apps—one of the toughest trade-offs is between…

12/27/2025Edge AI, Object Detection, Real-Time Inference
3DDFA_V2: Real-Time, CPU-Efficient 3D Face Alignment for Video and Edge Applications

3DDFA_V2: Real-Time, CPU-Efficient 3D Face Alignment for Video and Edge Applications 3081

If you’re building applications that require real-time 3D facial understanding—like video conferencing enhancements, augmented reality filters, biometric verification, or character…

12/27/20253D Face Alignment, Dense Facial Landmark Estimation, Real-time Face Tracking
Bunny: High-Performance Multimodal AI Without the Heavy Compute Burden

Bunny: High-Performance Multimodal AI Without the Heavy Compute Burden 1046

Multimodal Large Language Models (MLLMs) are transforming how machines understand and reason about visual content. Yet, their adoption remains out…

12/27/2025Efficient Inference, Multimodal Reasoning, vision-language modeling
Step-Video-T2V: Generate High-Quality, Long-Form Videos from Text in English and Chinese

Step-Video-T2V: Generate High-Quality, Long-Form Videos from Text in English and Chinese 3139

Step-Video-T2V is a state-of-the-art open-source text-to-video foundation model developed by StepFun AI. With 30 billion parameters and the ability to…

12/27/2025Multimodal Foundation Models, Text-to-Video Generation, Video Diffusion Models

Posts pagination

Previous 1 … 10 11 12 … 43 Next
Copyright © 2026 PaperCodex.
  • Facebook
  • YouTube
  • Twitter

PaperCodex