Parameterized Control of Voicing & Accent in Speech Generated by Deep Networks
Lead Research Organisation:
University of Southampton
Department Name: Sch of Electronics and Computer Sci
Abstract
Deep Mind have recently developed a convolutional network (WaveNet) which can be used, among other things, for generating speech from text (Oord et. al., 2016). This new model shows a significant improvement in perceived audio quality compared to conventional systems. Later, a parallelised version model is introduced, with faster-than-real-time generation (Oord et. al., 2017). In both papers, the main focus is on the quality of speech generated from text, but the authors briefly experiment with modifying the voicing of generated speech. They do this by introducing a speaker ID as a parameter to the model; training the network on recordings from several different speakers, this allows multiple discrete 'voices' to be generated from the same model.
However, I believe that with a more advanced model of voicing it would be possible to develop a system which allows users to alter the voicing continuously, giving greater control over the sound. A sufficiently advanced system could potentially also allow modelling and parameterization of accent and prosody of voice. I also wish to explore whether it is possible to decouple this 'voicing' control model from the TTS model. This could potentially enable the building of a speech-to-speech transformation network, which can manipulate the voicing of existing audio recordings.
However, I believe that with a more advanced model of voicing it would be possible to develop a system which allows users to alter the voicing continuously, giving greater control over the sound. A sufficiently advanced system could potentially also allow modelling and parameterization of accent and prosody of voice. I also wish to explore whether it is possible to decouple this 'voicing' control model from the TTS model. This could potentially enable the building of a speech-to-speech transformation network, which can manipulate the voicing of existing audio recordings.
Organisations
People |
ORCID iD |
Katayoun Farrahi (Primary Supervisor) | |
Callum Anderson (Student) |
Studentship Projects
Project Reference | Relationship | Related To | Start | End | Student Name |
---|---|---|---|---|---|
EP/R513325/1 | 01/10/2018 | 30/09/2023 | |||
2280381 | Studentship | EP/R513325/1 | 01/10/2019 | 30/09/2022 | Callum Anderson |