Awesome Multimodal Language Modeling Papers and Source Codes

Step-Audio: Unified Speech Understanding and Generation for Real-World Voice Applications 4571

Building intelligent voice interfaces used to mean stitching together separate speech recognition (ASR), text generation, and text-to-speech (TTS) systems—each with…

12/18/2025Multimodal Language Modeling, Speech Generation, Speech Understanding