If anyone has an Nvidia card and a couple sound samples of Marge’s voice – you don’t need a ton of VRAM for this like you do Stable Diffusion, can do it on an 8GB card – you can fire up Tortoise TTS and render this in her voice.
Is TTS voice replication much better self hosted than it was about half a year ago? Last time I tried it didn’t sound like the person I had samples of, instead I had to use elevenlabs to get close to sounding right
Last time I was running it, Tortoise TTS didn’t have a way to directly annotate voice with intonation or emotional stuff. The best you can do is pulling tricks like using a feature that lets you add some words to a sentence that aren’t actually spoken to affect the emotional impact of the words that are (e.g. sad words to make the spoken words be spoken in a sad voice).
Imagine the difference between someone saying gloatingly “none of you will survive” and someone saying it in an agonized voice.
I do wonder a bit whether it’d be possible to train it on a corpus that’s been automatically annotated with output from software that does sentiment analysis on text, and then generate keywords that one could use to alter the sound of sentences. I don’t think that this is so much a fundamental limitation of the software as it is limitations in the training set.
If anyone has an Nvidia card and a couple sound samples of Marge’s voice – you don’t need a ton of VRAM for this like you do Stable Diffusion, can do it on an 8GB card – you can fire up Tortoise TTS and render this in her voice.
Is TTS voice replication much better self hosted than it was about half a year ago? Last time I tried it didn’t sound like the person I had samples of, instead I had to use elevenlabs to get close to sounding right
I mean, that’s a subjective question. I think it’s decent. Here’s some samples:
https://nonint.com/static/tortoise_v2_examples.html
Last time I was running it, Tortoise TTS didn’t have a way to directly annotate voice with intonation or emotional stuff. The best you can do is pulling tricks like using a feature that lets you add some words to a sentence that aren’t actually spoken to affect the emotional impact of the words that are (e.g. sad words to make the spoken words be spoken in a sad voice).
Imagine the difference between someone saying gloatingly “none of you will survive” and someone saying it in an agonized voice.
I do wonder a bit whether it’d be possible to train it on a corpus that’s been automatically annotated with output from software that does sentiment analysis on text, and then generate keywords that one could use to alter the sound of sentences. I don’t think that this is so much a fundamental limitation of the software as it is limitations in the training set.