YuE: Open-Source Foundation Model for Full-Length, Lyrics-Aligned Song Generation in Multiple Languages

YuE: Open-Source Foundation Model for Full-Length, Lyrics-Aligned Song Generation in Multiple Languages
Paper & Code
YuE: Scaling Open Foundation Models for Long-Form Music Generation
2025 multimodal-art-projection/YuE
5810

Creating a complete, coherent song—complete with expressive vocals, lyrics that match the melody, and stylistically consistent accompaniment—has long been a frontier challenge in AI music generation. Most existing tools either produce short instrumental loops or generate vocals that drift out of sync with lyrics over time. Enter YuE, an open-source foundation model specifically engineered to solve the lyrics-to-song problem at scale. Built on the LLaMA2 architecture and trained on trillions of music tokens, YuE generates up to five minutes of structured, multi-track music while preserving lyrical alignment, genre fidelity, and vocal expressiveness across languages including English, Mandarin, Cantonese, Japanese, and Korean.

Unlike proprietary systems locked behind APIs or restrictive licenses, YuE is released under the Apache 2.0 license, enabling researchers, indie creators, and developers to use, modify, fine-tune, and even monetize outputs—provided they credit “YuE by HKUST/M-A-P.” This openness, combined with state-of-the-art musical performance, positions YuE as a strategic alternative for anyone seeking controllable, transparent, and production-ready AI music generation.

Why Full-Song Coherence Matters—and How YuE Delivers It

The Challenge of Long-Form Music Generation

Most AI music models excel at generating 10–30 second clips but falter when asked to sustain structure, emotion, and lyrical relevance across multiple verses and choruses. Without explicit modeling of musical form, outputs often suffer from:

  • Lyrical drift: Words no longer align with melodic phrasing.
  • Structural collapse: Repetitive or incoherent transitions between sections.
  • Timbral inconsistency: Vocals or instruments change character mid-song.

YuE directly addresses these issues through three core innovations:

1. Track-Decoupled Next-Token Prediction

Traditional models treat music as a single dense audio stream, mixing vocals and instruments into an entangled signal. YuE instead decouples vocal and instrumental tracks during token prediction, enabling clearer separation and more precise control over each component.

2. Structural Progressive Conditioning

To maintain long-range coherence, YuE uses explicit structural labels (e.g., [verse], [chorus], [bridge]) in the input prompt. These labels condition the model progressively, ensuring each section respects its musical role while flowing naturally into the next.

3. Multi-Phase, Multi-Task Pre-Training

YuE’s training pipeline includes distinct phases that first build general music understanding, then specialize in lyrical alignment, genre adaptation, and vocal expressiveness. This staged approach enables robust generalization across languages and styles without catastrophic forgetting.

Real-World Capabilities That Solve Creative and Technical Pain Points

Generate Full Songs from Lyrics—No Studio Required

Provide YuE with structured lyrics (segmented by section) and genre tags like female pop uplifting bright vocal electronic, and it outputs a complete stereo song with synchronized vocals and accompaniment—ready for editing or direct use. This eliminates the need for manual composition, vocal recording, or DAW expertise.

Style Transfer and Voice Cloning via In-Context Learning (ICL)

YuE supports in-context learning using reference audio. Give it a 30-second clip of Japanese city pop, and it can generate an English rap in the same harmonic and rhythmic style—while keeping your custom lyrics. Two ICL modes are available:

  • Single-track ICL: Provide a mixed, vocal-only, or instrumental reference.
  • Dual-track ICL: Supply separate vocal and instrumental tracks for higher fidelity and better style retention—a feature unmatched by most competitors.

This enables powerful applications like genre remixing, vocal timbre transfer, or adaptive soundtrack generation for games and films.

Multilingual and Genre-Agnostic by Design

YuE handles multiple languages natively, with dedicated support for tonal languages like Mandarin and Cantonese. Its open-vocabulary tagging system allows mixing descriptors like male soulful jazz melancholic acoustic to precisely sculpt mood, instrumentation, and vocal character.

Getting Started: Simple Workflow, Powerful Results

You don’t need a PhD in machine learning to use YuE. The recommended workflow is:

  1. Prepare your input:

    • Write lyrics segmented by structure (e.g., [verse], [chorus]), with 2 newlines between sections.
    • Create a genre tag string using 3–5 descriptors (e.g., female indie folk warm acoustic).
  2. Choose your generation mode:

    • Chain-of-Thought (CoT) mode: For diverse, original outputs from lyrics only.
    • ICL mode: For style-controlled generation using reference audio.
  3. Run inference:
    Use the provided infer.py script with pre-trained checkpoints like YuE-s1-7B-anneal-en-cot (for English lyrics) and YuE-s2-1B-general (for audio refinement). Community tools like YuE-UI (Gradio), YuE-extend (Colab), and Pinokio (Windows one-click) lower the entry barrier further.

  4. Fine-tune if needed:
    LoRA support allows efficient adaptation to niche genres, voices, or languages without full retraining.

Practical Considerations: Hardware, Speed, and Best Practices

While powerful, YuE has realistic constraints:

  • GPU Memory:

    • Minimum: 24GB VRAM (e.g., RTX 4090) for 1–2 song segments.
    • Recommended: 80GB+ (e.g., A100/H800) for batch generation or full 5-minute songs.
    • Workarounds: Quantized versions (YuE-exllamav2, YuEGP) enable inference on 8GB GPUs, though with slight quality trade-offs.
  • Generation Speed:

    • ~6 minutes per 30 seconds of audio on RTX 4090.
    • ~2.5 minutes on H800.
  • Prompt Engineering Tips:

    • Avoid overloading a single segment; 30 seconds ≈ 3–4 lines of lyrics.
    • Use [verse] or [chorus] to start—[intro] is less stable.
    • For instrumental-only output, omit vocal-related tags and follow guidance in GitHub issue #18.
  • Licensing & Ethics:
    Apache 2.0 grants commercial freedom, but users must ensure originality and avoid copyright infringement. Attribution (“YuE by HKUST/M-A-P”) is strongly encouraged for public or commercial releases.

Why Choose YuE Over Closed, Proprietary Alternatives?

YuE matches or exceeds leading proprietary systems in vocal agility, musical coherence, and lyrical alignment—while offering what they don’t:

  • Full transparency: Open weights, code, and training methodology.
  • Community collaboration: Actively maintained with community contributions (e.g., UIs, Colab notebooks).
  • No vendor lock-in: Run locally, fine-tune, or deploy in your own pipeline.
  • Research-ready: YuE’s representations even excel on music understanding benchmarks like MARBLE.

For teams prioritizing control, reproducibility, and creative freedom, YuE isn’t just an alternative—it’s the foundation for the next generation of AI-assisted music creation.

Summary

YuE redefines what’s possible in open-source AI music generation. By solving the long-form lyrics-to-song challenge with architectural innovation, multilingual support, and practical in-context learning, it empowers creators, researchers, and developers to produce studio-quality, structured songs without traditional barriers. With Apache 2.0 licensing, active community tooling, and competitive performance against closed systems, YuE stands as a strategic, future-proof choice for anyone serious about integrating AI into the music creation pipeline.