InspireMusic: Generate High-Fidelity, Long-Form Music from Text or Audio with LLM and Super-Resolution

Paper & Code

InspireMusic: Integrating Super Resolution and Large Language Model for High-Fidelity Long-Form Music Generation

2025 • FunAudioLLM/InspireMusic

★1254

InspireMusic is an open-source framework that redefines what’s possible in AI-powered music generation. By seamlessly integrating a large language model (LLM) based on Qwen 2.5 with a super-resolution flow-matching model, InspireMusic enables the creation of high-fidelity, long-form music—up to 8 minutes in length—from simple text prompts or audio continuations. Unlike many existing systems that produce short, low-quality clips or require massive computational resources, InspireMusic delivers studio-grade audio with strong temporal coherence, all while maintaining efficient training and inference through a novel single-codebook audio tokenizer.

Designed for creators, developers, and researchers, InspireMusic supports tasks like text-to-music, music continuation, and audio super-resolution, making it ideal for applications ranging from video game soundtracks to AI-assisted composition. With pre-trained models like InspireMusic-1.5B-Long openly available and straightforward inference APIs, getting started takes just a few minutes—no PhD required.

Why InspireMusic Stands Out

Unified Architecture for High-Quality, Long-Form Audio

At its core, InspireMusic combines two powerful components:

An autoregressive transformer built on the Qwen 2.5 architecture, trained to predict discrete audio tokens from both text and audio inputs using next-token prediction.
A super-resolution flow-matching model that upsamples low-sampling-rate token sequences into high-fidelity 48kHz stereo waveforms with fine-grained acoustic details.

This two-stage pipeline separates semantic modeling (handled by the LLM) from waveform synthesis (handled by the flow-matching model), allowing each to specialize without compromising quality or length. The result? Music that sounds natural, emotionally expressive, and structurally coherent—even over several minutes.

Efficient Tokenization with a Single Codebook

A key innovation is InspireMusic’s use of a single-codebook audio tokenizer, which encodes raw audio into compact yet semantically rich tokens. This contrasts with multi-codebook approaches (like those in earlier MusicGen versions) that increase model complexity and training costs. By reducing redundancy and focusing on high-bitrate compression, InspireMusic lowers resource demands while preserving musical nuance—making high-quality generation more accessible.

Real-World Capabilities That Matter

Generate Minutes of Coherent Music, Not Just Seconds

Most open-source music generators cap outputs at 10–30 seconds. InspireMusic breaks this barrier: the InspireMusic-1.5B-Long variant supports generation up to 8 minutes, enabling full musical arrangements with intros, verses, choruses, and outros. This is transformative for creators needing background scores for films, podcasts, or games.

Dual-Modality Prompting: Text, Audio, or Both

Need ambient jazz for a café scene? Just type a description. Want to extend a melody you recorded on your phone? Upload a 10-second clip. InspireMusic supports:

Text-to-music: Generate from natural language prompts.
Music continuation: Extend existing audio while preserving style and key.
Hybrid prompting: Combine text and audio for precise control (e.g., “continue this classical piano piece with a dramatic climax”).

High Sampling Rates for Professional Use

Outputs are generated at 48kHz stereo (with 24kHz mono options), matching industry standards for music production. The flow-matching super-resolution stage ensures crisp highs, warm mids, and deep lows—critical for listeners using high-quality headphones or studio monitors.

Getting Started Is Straightforward

InspireMusic prioritizes usability without sacrificing power:

Install with standard Python tools (Python ≥3.8, PyTorch ≥2.0.1, CUDA ≥11.8) or via Docker for containerized deployment.
Download pre-trained models like InspireMusic-1.5B-Long from ModelScope or Hugging Face in one command.

Generate music instantly via CLI or Python API:

python -m inspiremusic.cli.inference --task text-to-music -t "Epic orchestral music with thunder and deep drums"

Or programmatically:

model = InspireMusicModel(model_name="InspireMusic-1.5B-Long")  
model.inference("text-to-music", "Calm lo-fi beats for studying")

Docker support eliminates dependency headaches, while optional “fast mode” (skipping flow-matching) enables rapid prototyping on modest hardware.

Practical Considerations

While powerful, InspireMusic has realistic constraints:

Hardware: GPU with ≥16GB VRAM recommended for full inference (especially with flow-matching).
Base vs. Long models: Standard models (e.g., InspireMusic-1.5B) generate ~30s clips; only the “Long” variant supports multi-minute output.
Vocal content: The current focus is instrumental music. For singing with lyrics, the separate InspireSong-1.5B model is under development.

These limitations are clearly documented, helping users choose the right variant for their needs.

How It Compares to MusicGen and Stable Audio 2.0

In both subjective listening tests and objective metrics (e.g., FAD, KLD), InspireMusic-1.5B-Long matches or exceeds top open-source alternatives like Meta’s MusicGen and Stability AI’s Stable Audio 2.0. Its advantages include:

Longer coherent segments (8 min vs. MusicGen’s 30s without stitching).
Higher sampling rate support (48kHz stereo out-of-the-box).
More efficient training via single-codebook tokenization.

For teams seeking a maintainable, extendable codebase with state-of-the-art results, InspireMusic offers a compelling open-source alternative to closed or fragmented systems.

Summary

InspireMusic delivers what many AI music tools promise but few achieve: high-fidelity, long-form, controllable music generation from intuitive inputs. By fusing the reasoning power of a modern LLM with the acoustic precision of flow-matching super-resolution, it solves real pain points—short clips, muffled audio, and opaque control—while remaining accessible to developers and creators. Whether you’re prototyping a game soundtrack, exploring generative art, or researching controllable audio synthesis, InspireMusic provides a robust, open foundation to build upon.