Officially published in January 2025, this pioneering work, led by Jinhua Liang from the Centre for Digital Music, under the supervision of Dr. Emmanouil Benetos, is a collaboration between Queen Mary University of London and the University of Surrey.
This work represents a major advancement in expanding the boundaries of multimodal large language models in the audio domain. While LLMs and visual language models (VLMs) are widely used for text and image processing, they have struggled to generalise audio data effectively. The Acoustic Prompt Tuning (APT) framework is one of the first to systematically integrate sound understanding into LLMs and VLMs, enabling AI to interpret, recognise, and reason about audio within a natural language framework.
This innovation contributes to multiple fields, including AI-driven audio analytics, speech and music processing, assistive technology for the hearing-impaired, and audio-visual AI applications.
The research has been recognised by leading experts and has been accepted for publication in the IEEE Transactions on Audio, Speech, and Language Processing, a flagship journal in the field of audio signal processing and Artificial Intelligence (AI).
APT unlocks new capabilities for AI, improving AI assistants that can understand and react to sound environments, such as detecting emergency alarms. It enhances hearing aids by providing descriptions of sounds in natural language for the hearing-impaired. It also advances multimedia search engines, allowing users to search for content based on audio cues, such as "find me a video where a bird is chirping." Compared to existing solutions, APT-LLMs offer a more efficient and scalable way to integrate audio into AI systems, reducing the need for costly retraining.
This breakthrough paves the way for new research directions in multimodal AI, promoting seamless integration between audiovisual and language models. It is expected to have significant implications for speech, audio, and music signal processing, multimedia understanding and retrieval, hearing aid technology, and AI-driven content generation. This marks a new frontier in AI development, making AI more perceptive and capable in real-world applications involving sound.
Take a look at the research paper here.