Awesome Distributed Deep Learning Papers and Source Codes

Elixir: Train Large Language Models Efficiently on Small GPU Clusters Without Expert-Level Tuning 41294

Training large language models (LLMs) has traditionally been the domain of well-resourced AI labs with access to massive GPU clusters…

12/26/2025Distributed Deep Learning, Large Language Model Training, Memory-efficient Training

Megatron-LM: Train Billion-Parameter Transformer Models Efficiently on NVIDIA GPUs at Scale 14515

If you’re building or scaling large language models (LLMs) and have access to NVIDIA GPU clusters, Megatron-LM—developed by NVIDIA—is one…

12/26/2025Distributed Deep Learning, Large Language Model Training, Mixture-of-Experts

Colossal-Auto: Automate Large Model Training with Zero Expertise in Parallelization or Checkpointing 41290

Training large-scale AI models—whether language models like LLaMA or video generators like Open-Sora—has become increasingly common, yet remains bottlenecked by…

12/18/2025Distributed Deep Learning, Large Language Model Training, Video Generation Model Training