Parameterized Control of Voicing & Accent in Speech Generated by Deep Networks

Lead Research Organisation: University of Southampton
Department Name: Sch of Electronics and Computer Sci

Abstract

Deep Mind have recently developed a convolutional network (WaveNet) which can be used, among other things, for generating speech from text (Oord et. al., 2016). This new model shows a significant improvement in perceived audio quality compared to conventional systems. Later, a parallelised version model is introduced, with faster-than-real-time generation (Oord et. al., 2017). In both papers, the main focus is on the quality of speech generated from text, but the authors briefly experiment with modifying the voicing of generated speech. They do this by introducing a speaker ID as a parameter to the model; training the network on recordings from several different speakers, this allows multiple discrete 'voices' to be generated from the same model.

However, I believe that with a more advanced model of voicing it would be possible to develop a system which allows users to alter the voicing continuously, giving greater control over the sound. A sufficiently advanced system could potentially also allow modelling and parameterization of accent and prosody of voice. I also wish to explore whether it is possible to decouple this 'voicing' control model from the TTS model. This could potentially enable the building of a speech-to-speech transformation network, which can manipulate the voicing of existing audio recordings.

Publications

10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
EP/R513325/1 01/10/2018 30/09/2023
2280381 Studentship EP/R513325/1 01/10/2019 30/09/2022 Callum Anderson