Blind Estimation of Binaural Parameters with Artificial Neural Networks

Lead Research Organisation: University of Salford

Department Name: Sch of Computing, Science & Engineering

Abstract

There exist various parameters used in Binaural technology which are impossible to calculate analytically without pre-knowledge of a number of factors, such as the impulse response of the room being used. One such example is the Inter-Aural Cross-correlation Coefficient (IACC), a measure of the 'spaciousness' experienced by a listener. Artificial Neural Networks (ANNs) are capable of being trained to estimate parameters like this in much the same way an expert human listener would - via intuitive 'knowledge' of the measure and how it sounds at a given value. This has been done in the past to estimate such parameters as Reverberation Time, and with recent advantages in ANN technology it is hoped that this will also be feasible to a high degree of accuracy for those used in Binaural sound reproduction.

Student:

Philippa Demonte

Period of Study:

Jan 17 - Dec 23

Funder:

EPSRC

Project Status:

Closed

Project Category:

Studentship

Project Reference:

1855512

Research Topic:

Unclassified

Organisations

University of Salford (Lead Research Organisation)

People	ORCID iD
Trevor Cox (Primary Supervisor)
Philippa Demonte (Student)

Publications

Author Name

Title Publication Date Published

10 25 50

Studentship Projects

Project Reference	Relationship	Related To	Start	End	Student Name
EP/N509772/1			01/10/2016	30/09/2021
1855512	Studentship	EP/N509772/1	02/01/2017	08/12/2023	Philippa Demonte

Key Findings
Research Tools and Methods
Engagement Activities


Description	My research is focusing on how manipulations of object-based audio (OBA) could help to improve speech intelligibility, particularly with regards to the creative industries. Speech intelligibility is the amount of spoken dialogue which can be audibly heard and understood. Taking television broadcasting as an example, content producers currently use a channel-based approach to audio: in advance of transmission, they create finished sound mixes for specific loudspeaker configurations in the home, generally stereo (left and right loudspeakers on a TV set) or 5.1. Beyond the ability to turn the overall volume up or down, the end-user has very little control over altering that sound to their particular hearing needs or preferences. Thanks to recent technological developments originating from the gaming and film industries, in the not-too-distant future, broadcasters will instead be able to take an object-based audio (OBA) approach, in which sound stems (dialogue, background music, sound effects, atmospheric sounds, and so on) will be transmitted as individual entities together with their accompanying metadata (information about each sound object, analagous to a 'recipe'). The rendering software in the receiving device, such as a smart television or smart speaker, will then put together a sound mix adapted to the configuration of the available listening devices and the end-users needs. Using a concept known as media device orchestration (MDO), it will be possible to use the loudspeakers found in mobile phones, laptops, tablets, and other readily-available devices to create augmented, ad-hoc arrays of loudspeakers in the home. Binaural room impulse responses (BRIRs) or head-related transfer functions (HRTFs) could also be incorporated at the rendering stage to create the impression of different acoustic spaces or spatial separation of sounds. For users of Bluetooth-enabled headphones or ear buds which contain accelerometers, this can be made to sound even more realistic with the implementation of head tracking, such that the source location of the sound does not appear to move even if they turn their head. An object-based audio approach allows for greater audio accessibility, personalisation, immersiveness, and interactivity. The research funded through this award has involved four separate quantitative psychoacoustics listening experiments: * One of the four biggest complaints received by broadcasters in terms of audiability is with regards to the use of background music when co-present with dialogue in programmes. Currently, the only guidelines given by broadcasters to content producers on this matter are: i) to turn down the overall level of the background music relative to the dialogue, and ii) to avoid heavily percussive music. With respect to the latter, however, a psychoacoustics listening experiment conducted by the researcher determined that neither the tempo nor the 'percussiveness' of background music have any significant effect on speech intelligibility. Surprisingly, the only detrimental effect in this experiment was caused by a music sample featuring a solo cello, but the addition of instrumentation, including percussive instruments, was found to restore the speech intelligibility. Cellos are within the same fundamental frequency range as the human voice. When both are co-present in an auditory stream, they are stimulating the same areas of the basilar membrane within the inner ear. Either the solo cello music in this listening experiment caused confusion to the mind in terms of what was being heard, or conversely, was so salient as to cause distraction from attending to the dialogue. The restoration of the speech intelligibility with additional instrumentation would suggest the latter, but further investigation is required. * A two-part listening experiment was conducted to compare two contrasting audio engineering manipulations in terms of their effect on speech intelligibility, perceived sound quality, and preference when only applied to background sounds: ducking, which reduces the overall background sound level relative to the foreground speech level, and downward dynamic range expansion (DDRE), which only reduces background levels that are below a certain threshold whilst still retaining levels above the threshold. Ducking is currently used in broadcasting and is already proven to be an effective means of improving speech intelligibility by allowing greater audibility of dialogue. However, with an object-based approach to sound, the application of DDRE could be a potential means of improving speech intelligibility whilst also retaining the narrative importance of certain non-speech background sounds, for example the background music in David Attenborough style nature documentaries. Part 1 of the listening experiment determined that application of even the lowest amount of DDRE was significantly beneficial to speech intelligibility. However, analysis of the data collected in part 2 of the listening experiment indicated that high amounts of DDRE are detrimental to both the perceived overall sound quality and preference, i.e. there is a trade-off between intelligibility and acceptability. In contrast, the highest amount of ducking is significantly beneficial to speech intelligibility without any significant effect to the perceived quality or preference. * A listening experiment investigating the use of headphones with small-screen devices found that speech intelligibility can be significantly improved when BRIRs and headtracking are applied only to the dialogue audio object whilst keeping all other audio objects in a regular stereo mix. This action provides the relevant auditory cues to make the mind perceive the dialogue as being co-located with the screen and therefore spatially separated from the rest of the sound mix. * The final listening experiment of this series, which took place in March 2020, was based on the concept of MDO. Imagine a living room type of listening environment with a regular stereo (left/right) pair of loudspeakers in a TV set, and additionally a mobile phone being used as a dedicated loudspeaker for foreground speech sound objects. If a time delay was to be applied to the speech signal from this device, such that regardless of whether the phone was to be placed on a coffee table directly ahead of the listener or on a book shelf directly to the side, based on a psychoacoustic concept known as the precedence effect, the perceived sound image would always appear to originate from the phantom centre position between the stereo pair of loudspeakers, what would be the effect on speech intelligibility, and why? The results of the speech-in-noise test in this listening experiment determined that a third loudspeaker significantly improves speech intelligibility compared to the baseline condition of just a stereo pair of loudspeakers, and this is solely due to a small boost in the speech level afforded by the precedence effect. There was no significant difference between the results for the two positions of the additional loudspeaker (directly ahead versus directly to the side), therefore a psychoacoustic concept known as binaural unmasking was not responsible for the improvement of speech intelligibility in this case. It should be noted that for the purpose of valid scientific investigation, this listening experiment was conducted under carefully controlled conditions, including use of the same Genelec loudspeakers for all positions of the array.
Exploitation Route	These findings can be turned into recommendations for content producers of object-based media, such as the way in which instrumentation in background music is grouped on audio stems, or with the creation of additional metadata rules to provide the end-user with more options for how they want to hear a sound mix. For example, rules on which device or loudspeaker dialogue versus non-dialogue audio objects should be sent to depending on the configuration available, or a rule that additional instrumentation should be added to the background music if it features a solo instrument and is timewise co-present with dialogue, or even provide end-users with a choice of background music for when co-present with dialogue. Object-based audio approaches have only been implemented since 2016 by film makers and select cinemas, and since 2018 by the South Korean national broadcasting corporation. In Europe this approach is currently being implemented with the audio from sports coverage, including football matches and tennis games. Research and Development departments at broadcasters such as the BBC are currently testing different aspects of OBA approaches, including the production work flow, rendering software for mobile phones, and end-use with smart televisions and smart speakers. On the live scene there is at least one known loudspeaker organisation implementing OBA approaches into theatre and live music. Interactive media companies are interested in OBA for gaming. These are some of the sectors which potentially could implement my research findings.
Sectors	Creative Economy,Digital/Communication/Information Technologies (including Software),Leisure Activities, including Sports, Recreation and Tourism,Culture, Heritage, Museums and Collections,Other


Title	Speech corpus - accompanying masking noises
Description	Master audio files of speech-shaped noise (SSN) and speech-modulated noise (SMN) generated for use with the Demonte (2019) digital re-recording of the HARVARD speech corpus. Created by the researcher for use in quantitative and qualitative psychoacoustic listening experiments towards their PhD research, in particular speech-in-noise (SIN) tests of speech intelligibility.
Type Of Material	Improvements to research infrastructure
Year Produced	2019
Provided To Others?	Yes
Impact	The master audio files have been made openly available via the University of Salford's Figshare repository. Results from psychoacoustic (listening) experiments which have actively used these audio files will be published in their PhD thesis in Autumn 2020.
URL	https://salford.figshare.com/collections/HARVARD_corpus_Speech_Shaped_Noise_and_Speech_Modulated_Noi...


Title	Speech corpus recording
Description	A new high quality digital audio recording of the HARVARD speech corpus was generated by the researcher for use in psycho-acoustic listening experiments towards their PhD research in 2019/20, in particular for quantitative speech-in-noise (SIN) tests of speech intelligibility. The recordings and associated materials have been made openly available via the University of Salford's Figshare repository.
Type Of Material	Improvements to research infrastructure
Year Produced	2019
Provided To Others?	Yes
Impact	The advantages of this particular recording are that: 1) it features all 720 phonetically-balanced sentences of the corpus, allowing for large numbers of trials without creating a learning effect; 2) it features a female, native British-English speaker, negating the strongly significant effect of accent when testing speech intelligibility with British-English participants; 3) since it has been digitally recorded, this version is of a much higher audio quality than previous versions recorded by others on tape.
URL	https://salford.figshare.com/collections/HARVARD_speech_corpus_-_audio_recording_2019/4437578


Description	Big Bang Fair 2018
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Schools
Results and Impact	Represented the University of Salford on a stand at the Big Bang Fair, a STEM outreach event at the Birmingham NEC attended by several thousand people from across the UK. Delegates for Day 1 were middle- and secondary school pupils and their teachers; day 2 was for families. Used visual props to get people interacting with and talking about the science of sound; promoted the terms 'acoustics' and 'audio engineering', which many teachers and pupils may not have heard of before, much less discussed as potential application areas or career pathways with the subjects that they are studying; promoted the undergraduate degree programmes in these areas, in particular answering questions from parents. Participation in this outreach event enabled me to engage in 2-way conversations with large numbers of the general public about my research, using the visual props to help explain complex concepts in simple-to-understand lay terms. My visibility on the stand, as a woman in STEM, also resulted in several conversations with girls about taking subjects that they traditionally think of as 'hard', and encouraging them to think otherwise.
Year(s) Of Engagement Activity	2018


Description	Three Minute Thesis Competition (3MT)
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	Local
Primary Audience	Postgraduate students
Results and Impact	Participated in the University of Salford's heat of Three Minute Thesis (3MT), an internationally-recognised compeition in which participants have up to a maximum of three minutes and one PowerPoint slide in which to present their doctoral research to an audience. The event was allegedly filmed / recorded by the university. Created greater awareness not only of my own research related to this award, but also generally of the opportunities that the associated technological developments will generate for the general public.
Year(s) Of Engagement Activity	2019

Abstract

Organisations

People

ORCID iD

Publications

Studentship Projects