PaperCodex

Meta-Transformer: One Unified Model for 12 Modalities—No Paired Data Needed 1644

In today’s AI landscape, building systems that understand multiple types of data—text, images, audio, video, time series, and more—is increasingly…

12/17/2025Foundation Model, Multimodal Learning, Representation Learning

MergeKit: Build Powerful, Multitask LLMs by Merging Models—No Retraining Needed 6574

In today’s fast-moving landscape of open-source large language models (LLMs), developers and researchers are increasingly faced with a dilemma: dozens…

12/17/202512/17/2025Model Mergin

MedRAX: Unified AI Agent for Complex Chest X-ray Reasoning Without Retraining 1048

In clinical radiology, interpreting chest X-rays (CXRs) demands more than just identifying abnormalities—it requires synthesizing visual findings, clinical context, patient…

12/17/2025Chest X-ray Interpretation, Medical Image Reasoning, Multimodal Clinical AI

HierSpeech++: Human-Level Zero-Shot Speech Synthesis with Fast Inference and High Fidelity 1232

In the rapidly evolving field of speech synthesis, achieving natural-sounding, speaker-consistent voice generation without speaker-specific training data has long been…

12/17/2025Speech Super-Resolution, Voice Conversion, Zero-shot Text-to-Speech

FlashRAG: A Modular, Lightweight Toolkit for Reproducible and Efficient Retrieval-Augmented Generation Research 3208

Retrieval-Augmented Generation (RAG) has emerged as a cornerstone technique for enhancing the factual grounding, knowledge scope, and reasoning capabilities of…

12/17/2025Multimodal RAG, Reasoning-Augmented QA, Retrieval-Augmented Generation

HunyuanVideo: Open-Source, High-Fidelity Video Generation That Rivals Closed Models 11437

HunyuanVideo is a groundbreaking open-source video foundation model developed by Tencent, designed to deliver professional-grade video generation capabilities without the…

12/17/2025Image-to-video Generation, Multimodal Video Synthesis, Text-to-Video Generation

FireRedASR: Industrial-Grade Mandarin Speech Recognition with SOTA Accuracy and LLM Integration 1658

FireRedASR is an open-source, industrial-grade automatic speech recognition (ASR) system specifically engineered for Mandarin Chinese—but with strong capabilities in Chinese…

12/17/2025Automatic Speech Recognition, LLM-Integrated Speech Processing, Multilingual ASR

UltraRAG: Build Adaptive, Multimodal RAG Systems Without Writing Complex Code 2325

Retrieval-Augmented Generation (RAG) has become a cornerstone technique for grounding large language models (LLMs) in real-world knowledge. However, building effective…

12/16/2025Adaptive Knowledge Integration, Multimodal Reasoning, Retrieval-Augmented Generation

HunFlair: State-of-the-Art Biomedical Named Entity Recognition with Just Four Lines of Code 14333

Biomedical text is dense with critical information—gene names, chemical compounds, diseases, species—but extracting that information manually is time-consuming and error-prone.…

12/15/2025Biomedical Text Mining, Named Entity Recognition, Sequence Labeling