AI for Sound

Lead Research Organisation: University of Surrey
Department Name: Vision Speech and Signal Proc CVSSP

Abstract

Imagine you are standing on a street corner in a city. Close your eyes: what do you hear? Perhaps some cars and busses driving on the road, footsteps of people on the pavement, beeps from a pedestrian crossing, rustling and clonks from shopping bags and boxes, and the hubbub of talking shoppers. You can do the same in a kitchen as someone is making breakfast, or as you are working in a busy office. Now, following the successful application of AI and machine learning technologies to the recognition of speech and images, we are beginning to build computer systems to tackle the challenging task of "machine listening", to build computer systems to automatically analyse and recognize everyday real-world sound scenes and events.

This new technology has major potential applications in security, health & wellbeing, environmental sensing, urban living, and the creative sector. Analysis of sounds in the home offers the potential to improve comfort, security, and healthcare services to inhabitants. In environmental sound sensing, analysis of urban sounds offers the potential to monitor and improve soundscapes experienced for people in towns and cities. In the creative sector, analysis of sounds also offers the potential to make better use of archives in museums and libraries, and production processes for broadcasters, programme makers, or games designers. The international market for sound recognition technology has been forecast to be worth around £1bn by 2021, so there is significant potential for new tools in "AI for sound" to have a major benefit for the economy and society.

Nevertheless, realising the potential of computational analysis of sounds presents particular challenges for machine learning technologies. For example, current research use cases are often unrealistic; modern AI methods, such as deep learning, can produce promising results, but are still poorly understood; and current datasets may have unreliable or missing labels.

To tackle these and other key issues, this Fellowship will use a set of application sector use cases, spanning sound sensing in the home, in the workplace and in the outdoor environment, to drive advances in core machine learning research.

Specifically, the Fellowship will focus on four main application use cases: (i) monitoring of sounds of human activity in the home for assisted living; (ii) measuring of sounds in non-domestic buildings to improve the office and workplace environment; (iii) measuring sounds in smart cities to improve the urban environment; and (iv) developing tools to use sounds to help producers and consumers of broadcast creative content.

Through this Fellowship, we aim to deliver a step-change in research in this area, bringing "AI for Sound" technology out of the lab, helping to realize its potential to benefit society and the economy.

Planned Impact

The proposed research has the potential to benefit the UK and international economy and society through machine recognition of sounds as a key enabling technology. The market for sound recognition technology has been forecast to be worth around £1bn internationally by 2021 (Jeronimo. Driving New Revenue Streams from Intelligent Devices through Sound Recognition, IDC, Dec 2017), and the recent DCASE workshops in 2017 and 2018 have attracted around 40% industry representation. The UK acoustics industry has a turnover of £4.6bn across 750 companies (UK Acoustics Network. UK Acoustics: Sound Economics. March 2019). Acoustics is relevant to many industry sectors, from aerospace and automotive to consumer goods and non-destructive testing, with significant potential for impact by new tools in "AI for Sound".

Example potential impacts from different sectors are given below.

Commercial private sector:
* Providers of remote health and social care, through new methods to use sound sensing to assist people to live independently for longer;
* Internet-of-things companies who supply smart buildings and smart cities with networked sensor systems, through access to novel acoustic algorithms for more sophisticated mapping.
* Acoustic consultants, through access to new ways of mapping and understanding soundscapes, which will help drive new design possibilities for the built environment.
* Commercial companies requiring sound sensing, through access to new audio research;
* Television and radio companies, through ability to use sound data exploration technologies in the creation, editing and re-use of audio and audiovisual programmes.
* Computer games companies, through new ways to reuse sound datasets creatively for new game sounds;
* Audio archiving companies, through access to the latest algorithms and methods for annotating and exploring sound archives;
* Musicians, composers and sound artists, through ways to find and explore new sounds as part of their creative output;

Policy-makers and others in government and government agencies:
* Smart cities, through better ways to make sense of acoustic data and improve urban soundscapes;
* Urban planning authorities, through new insights into the impact of sounds and how to visualize and understand these impacts;
* Environmental monitoring agencies, through new measurements of sound impact offering the potential to develop new noise policies and so improve wellbeing of citizens;

Public sector, third sector and others:
* Museums and other organizations with sound archives, through new software methods to allow people to explore and use their archives;
* Science promotion organizations, in particular through outputs from the projects on how people perceive and navigate sounds.
* Environmental organizations, through new ways to monitor biodiversity;

Wider public:
* People living with dementia and others in need of assisted living to continue to live at home, through new and simpler monitoring methods enabled by sound sensing.
* People working in office workplaces, through new tools to measure impact of sound leading to new designs of workplace soundscapes.
* People living in urban environments, through improved city sound and noise policies and better designed soundscapes, making the urban environment more pleasant;
* Audiences of creative output involving audio and music, through availability of new creative outputs facilitated by creative access to new sounds.
* People interested in exploring audio recordings at home, school, college or university, either for educational or general interest purposes;
* Teachers in schools, colleges or universities who want to use sound examples for teaching audio or music;

Researchers employed on the project:
* Improved skills in research methodologies, which may be transferred into the commercial sector on completion of the project.

For specific plans for the realisation of impact, see "Pathways to Impact".
 
Description The project has created a number of new methods for automatic recognition and generation of sounds, including detection of sounds and their location ("sound event localization and detection"), new methods to generate text captions from sound clips ("audio captioning"), new methods for to separate sounds from a mixture ("audio source separation"), new methods to build efficient sound recognition models through model compression, and a new method for sound synthesis using generative AI methods ("AudioLDM").
Exploitation Route Sound recognition has a wide range of potential applications, ranging from assisted living in the home, security applications, smart city monitoring, or machine condition monitoring. New audio generation methods have potential applications in the creative industry. Cross-modal methods such as audio caption have the potential to improve accessibility for audio content.
Sectors Creative Economy

Digital/Communication/Information Technologies (including Software)

Environment

Healthcare

Culture

Heritage

Museums and Collections

URL https://ai4s.surrey.ac.uk/
 
Description Presentations to industry, practitioner groups, and wider public such as the Association of Noise Consultants, Huawei Future Device Technology Summit, Amazon Audio Tech Summit, and a set of talks to USA West Coast technology companies (Microsoft, Meta, Adobe, ByteDance, Apple, Amazon). There is also significant industry interest in the AudioLDM generative model, leading to a number of discussion on potential applications.
First Year Of Impact 2020
Sector Creative Economy,Digital/Communication/Information Technologies (including Software),Healthcare
Impact Types Economic

 
Title Sound Wellbeing in Later Life Experience Sampling Method 
Description Project generated a series of listening and memory activities to be delivered by a mobile app to measure Soundscape features and personal reflections on sound. 
Type Of Material Improvements to research infrastructure 
Year Produced 2023 
Provided To Others? No  
Impact Pilot participant engagement in activities had high adherence, providing positive initial assessment of utility for research area. Further deployments needed to build publishable results on efficacy. 
 
Title DCASE2021 UAD-S UMAP Data 
Description Support data for our paper: USING UMAP TO INSPECT AUDIO DATA FOR UNSUPERVISED ANOMALY DETECTION UNDER DOMAIN-SHIFT CONDITIONS ArXiv preprint can be found here. Code for the experiment software pipeline described in the paper can be found here. The pipeline requires and generates different forms of data. Here we provide the following: AudioSet_wav_fragments.zip: This is a custom selection of 39437 wav files (32kHz, mono, 10 seconds) randomly extracted from AudioSet (originally released under CC-BY). In addition to this custom subset, the paper also uses the following ones, which can be downloaded at their respective websites: DCASE2021 Task 2 Development Dataset DCASE2021 Task 2 Additional Training Dataset Fraunhofer's IDMT-ISA-ELECTRIC-ENGINE Dataset dcase2021_uads_umaps.zip: To compute the UMAPs, first the log-STFT, log-mel and L3 representations must be extracted, and then the UMAPs must be computed. This can take a substantial amount of time and resources. For convenience, we provide here the 72 UMAPs discussed in the paper. dcase2021_uads_umap_plots.zip: Also for convenience, we provide here the 198 high-resolution scatter plots rendered from the UMAPs. For a comprehensive visual inspection of the computed representations, it is sufficient to download the plots only. Users interested in exploring the plots interactively will need to download all the audio datasets and compute the log-STFT, log-mel and L3 representations as well as the UMAPs themselves (code provided in the GitHub repository). UMAPs for further representations can also be computed and plotted. 
Type Of Material Database/Collection of data 
Year Produced 2021 
Provided To Others? Yes  
URL https://zenodo.org/record/5123023
 
Title E-PANNs checkpoint 
Description The checkpoints of the pruned model (CNN14) when 50% of the filters from C7 to C12 convolutional layers are pruned. Please see Github link to live sound recognition demo using E-PANNs + the architecture of E-PANNs 
Type Of Material Database/Collection of data 
Year Produced 2023 
Provided To Others? Yes  
URL https://zenodo.org/record/7939402
 
Title E-PANNS: Sound Recognition using Efficient Pre-Trained Audio Neural Networks 
Description E-PANNs is an efficient version of existing PANNs network. E-PANNs can be used to recognize sound activity happening in the surrounding. E-PANNs can recognize 527 sound activities including human speech, animal sounds, ambulance siren and smoke alarm. The memory size requirement by the PANNs network is 312MB. On the other hand, E-PANNs requires approximately 92MB and also, E-PANNs is faster in computation compared to PANNs with improved performance. 
Type Of Technology Webtool/Application 
Year Produced 2023 
Open Source License? Yes  
Impact E-PANNs uses less memory requirement and computations compared to existing network. Therefore, the resource requirement for the E-PANNs is less than that of existing network and hence, E-PANNs offers more efficiency than existing method designed for recognizing sound activities in the surrounding. 
 
Title General Purpose Sound Recognition Demo 
Description The General-Purpose Sound Recognition application is intended to work on conventional computers, leveraging PANNs for real-time audio event detection. The developed software, equipped with a user-friendly Tkinter interface, has demonstrated significant utility in real-time event detection. The prediction is obtained by applying the audio tagging system on consecutive short audio segments. It is able to perform multiple updates per second on a moderate CPU. 
Type Of Technology Webtool/Application 
Year Produced 2021 
Open Source License? Yes  
Impact In a simple and user-friendly way, the potential of Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition can be demonstrated. 
URL https://github.com/yinkalario/General-Purpose-Sound-Recognition-Demo
 
Title Raspberry Pi-based Audio Event Detection System 
Description The system is a Raspberry Pi-based device utilizing the Google AIY Voice Kit and PANNs for real-time audio event detection. It allows user interaction via a button or web interface, presenting detected sound events on a timeline. The setup is designed for edge computing with a focus on applications in ambient assisted living and environmental monitoring, showcasing a practical implementation of AI in sound recognition and classification. 
Type Of Technology Physical Model/Kit 
Year Produced 2023 
Impact The project implemented Pre-trained Audio Neural Networks (PANNs) on a Raspberry Pi, leading to practical applications in ambient assisted living and environmental monitoring. Moreover, performance metrics such as the detection accuracy at different sound pressure levels and the effect of device operating temperature on latency were thoroughly measured, providing insights for further optimization and use in real-world scenarios. 
URL https://www.youtube.com/watch?v=ZNHtcqECNQQ
 
Description (Invited talk)keynote speakers (online presentation) of the satellite event "AI and Sustainability" at the Conference on Complex Systems 2023 (CCS2023), Salvador, Bahia, Brazil 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Delivered a invited talk entitled "Efficient AI models via Pruning", played a key role in making the Satellite workshop on "AI and Sustainability" a great success. The website link is http://aiandsustainability.com/.

The valuable insights and explaining the need for sustainable AI enriched the experience for all participants to design efficient AI/machine learning framework.
Year(s) Of Engagement Activity 2023
URL https://youtu.be/uAQ00xqhhEw
 
Description Interdisciplinary Perspectives on Soundscapes and Wellbeing webinar and workshop 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact University of Surrey hosted a two-day, in-person workshop on Soundscapes and Wellbeing with approx. 30 in-person attendees. including four keynote speakers from the Netherlands, Switzerland, Germany, and UK, and over 20 flash presenters from a range of disciplines spanning academia, non-departmental public bodies, and industry. The first day of the workshop was also streamed online as a recorded webinar, which 250 people registered for via Eventbrite. Peak attendance online was over 180 people from countries across the globe. The second day of the workshop focused on capacity-building in relation to soundscapes and wellbeing, and this has led to plans for further cooperation (e.g., funding bids and establishing a research network). We have also received emails from webinar attendees regarding collaboration possibilities.
Year(s) Of Engagement Activity 2024
URL https://www.surrey.ac.uk/news/university-surrey-hosts-successful-international-workshop
 
Description Invited Talk at "Health and Wellbeing Living Lab Symposium" 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Policymakers/politicians
Results and Impact Presented work of project completed in collaboration with LiCalab (Sound Wellbeing in Later Life study). Additionally pitched the value of Soundscape and Wellbeing studies in the context of Living Labs.
Year(s) Of Engagement Activity 2024
URL https://vitalise-project.eu/health-and-living-lab-symposium/
 
Description Invited flashtalk at "Interdisciplinary Perspectives on Soundscapes and Wellbeing" workshop (hybrid presentation) 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact This two-day workshop, led by University of Surrey's Environmental Psychology Research Group (EPRG), the event intends to advance the field by showcasing interdisciplinary research on soundscapes and wellbeing conducted within and beyond the UK. Day 1 includes talks from keynote speakers and over 20 experts from a range of disciplines (invited flashtalks of 5 minutes each). Day 2 involves World Café and sandpit sessions to enable further discussions on shared interests, with the view to generating substantive collaborative proposals, a working group, and written outputs.
Year(s) Of Engagement Activity 2024
URL https://www.eventbrite.co.uk/e/interdisciplinary-perspectives-on-soundscapes-and-wellbeing-tickets-7...
 
Description Invited talk to the Association of Noise Consultants, Aug 2020 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact Invited talk on "Artificial Intelligence for Sound" to the Association of Noise Consultants, 27 Aug 2020 (Video meeting).
Year(s) Of Engagement Activity 2020
URL https://www.linkedin.com/feed/update/urn:li:activity:6697179064635150337/
 
Description Sound Wellbeing in Later Life Co-creative workshop with Older Adults 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Study participants or study members
Results and Impact This workshop explores participants' experience of the study to generate potential ideas and research directions for the project. The workshop consists of two activities: Study Review and Directed Imagination. The Study Review activity aims to elicit feedback on how participants engaged with sound and memory during the study, to explore how the probes could be improved in the future. The Directed Imagination activity asks participants to think creatively about how they want to interact with sound and memory in their lives. The workshop is designed to explore concepts with layman audiences, in 2 hours. The research team will evaluate participant contributions and analyse the needs and opportunities offered by AI for sound.
Year(s) Of Engagement Activity 2023
 
Description Spotlight talk at The Turing Presents: AI UK 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Industry/Business
Results and Impact Presenting a Spotlight Talk on "AI for Sound" at the online The Turing Presents: AI UK, 23-24 March 2021, A showcase featuring UK academic work in AI and machine learning.
Year(s) Of Engagement Activity 2021
URL https://web.archive.org/web/20210601153918/https://www.turing.ac.uk/ai-uk