Text to speech issue #8259

cvrle77 · 2024-04-20T18:30:32Z

cvrle77
Apr 20, 2024

Using 4.0.5 beta, and absolutely blown away with TTS integrated in the program. It's insanely close to what I've desired, almost like you've been reading my mind on how it should work. Ok, 'nuff of this 'magic' stuff, now teh issues... :)

Not sure if it's doing it on other languages or engines, but here's what is going on, in short, TTS doesn't respect line breaks.

ElevenLabs TTS
American Rachel
best heard when 'the' is at the end of the first row,
Example

'We first washed the
orange with baking soda.'

I also have a question, since I am using free version of ElevenLabs just to test the quality, does selection of voices and languages increase with paid account, and can we use our own trained voice in the software?

13.mp4

cvrle77 · 2024-04-20T20:41:29Z

cvrle77
Apr 20, 2024
Author

Oh, I just figured one big flaw, and that is: if I have a video an hour long, for example, and I have a single mistake, or two (typo, or whatever, TTS did a poor job on some part), I'll have to redo everything from scratch.

And that is:

time consuming,
expensive, if working with paid 11labs API
almost no control over the process.

So, how about optimized interface that would work in this fashion:
We generate all, but it is 'line-by-line' interface, with play button under the each TTS line (or some other way of control)
meaning, we have text, and then WAV we can play to hear how it sounds.

If we messed something, or something got messed up in the process with interpunction and pauses, we can make changes, and refresh TTS only on that sentence.

We should be able to play everything from SE, inspect how everything sounds, perfect it, and when we are done, SE should do the magic of merging everything in a single audio file, like it currently does.

It sounds and looks much better and simpler in my head, hopefully you get the grasp of what I was saying. :D

Oh, and I am honored to be mentioned in *FIXED part in updates :)
Thank you.

0 replies

niksedk · 2024-04-21T05:05:50Z

niksedk
Apr 21, 2024
Maintainer

What do you mean by TTS doesn't respect line breaks ?

I think line breaks should be removed in normal situations before doing text to speech...
Perhaps SE should warn about line breaks and fast speech/high CPS?
I've added a little help here: https://www.nikse.dk/subtitleedit/help#text_to_speech

The list of voices for ElevenLabs will be in the eleven-labs-voices.json file in the TextToSpeech\ElevenLabs folder.
I got it from here: https://elevenlabs.io/docs/api-reference/get-voices
I do not think the selection of voices and languages increases with paid account.
I do not know how to add new voices.

It's a good idea with a preview before merging, but somewhat complex too. But perhaps something where voices can be removed.

0 replies

niksedk · 2024-04-21T10:16:01Z

niksedk
Apr 21, 2024
Maintainer

Beta updated: https://github.com/SubtitleEdit/subtitleedit/releases/download/4.0.5/SubtitleEditBeta.zip

0 replies

cvrle77 · 2024-04-21T14:06:25Z

cvrle77
Apr 21, 2024
Author

Oh, wow, that was blazing fast, and it works 👍 Rough, but does the job.
Btw, you have two 'Regenerate' buttons, first one should be EDIT, me thinks.

I also found two, hmmm, let's say 'missing options'.

All audio files are placed in temp folder with random names. It would be nice to have an option to change working folder, or to work in a folder where video originates (under a subfolder). Or at least to have temp folder named the same way like video file, and not like a random string.
We cannot save our work, and reload it, like we can do with the subtitles. We have to regenerate audio all over again. No idea how it would work, but I figured that option to reload the TTS project is missing, so I can continue to work on it next day.

What do you mean by TTS doesn't respect line breaks ?

I mean, if subtitle has a line break/two lines, last word from line 1 and first word from line 2 are being seen as one word.
In my example it's 'the orange', words are merged and pronounced as theorange. With most of the words you won't even notice the merge, it sounds the same. But with some, like in this case, it is different from what it should sound like.

Perhaps SE should warn about line breaks and fast speech/high CPS?

Or ignore line breaks and process like there is no line breaks automatically? This way we don't have to edit subtitle to get lines with no breaks.

EDIT: I've just noticed that there's some strange bug in 'regenerate' option, not sure how to reproduce it or give a 'recipe' for reproduction of the issue, but when I went and edited one line that I've tried to regenerate, only by removing line break and creating a single line sentence, that part became silent, and I couldn't regenerate it again no matter what. I've tried shortening entire line to a single word, no difference, I've tried changing the voice, nope. Just stays silent. That was with ElevenLabs engine. Yes I had enough credits. With Piper, it works good, but quality is nowhere near ElevenLabs. Could be related to how you ask EL for that single line?

Tried multiple times, edit doesn't do well at all with ElevenLabs, latest I had was that sound of all other lines is glitching, going too fast.

Also noticed that Piper likes to randomly hard cut last letter in last word, around 50-100ms at the end, and adds a pop sound. Regenerate solves that problem. :)

0 replies

cvrle77 · 2024-04-21T16:38:38Z

cvrle77
Apr 21, 2024
Author

I do not think the selection of voices and languages increases with paid account.
I do not know how to add new voices.

Well, that would be incredibly weird because they have 1000s of voices, and only few dozens is pulled from their library.

If I enter my API code, and pull voices manually through their interface, I get 3 new voices from the VoiceLab that can be found in their website, that I have picked from 1000s of voices. But in SE it doesn't pull those voices. I've tried deleting eleven-labs-voices.json so that it can pull new one, but still nothing in there. It is creating one and the same file, doesn't refresh it. If I create response manually, and copy result in elevenlabs json, I get those 3 extra voices.

0 replies

niksedk · 2024-04-21T18:20:40Z

niksedk
Apr 21, 2024
Maintainer

Beta updated: https://github.com/SubtitleEdit/subtitleedit/releases/download/4.0.5/SubtitleEditBeta.zip
If you press OK in the TTS window, the edited subtitle will be loaded into the SE main window.

Continuing a workflow could be done by adding more and more voices.

Can you try to download and overwrite the eleven-labs-voices.json file with the file from the api endpoint - does that give more voices?

Also noticed that Piper likes to randomly hard cut last letter in last word, around 50-100ms at the end, and adds a pop sound.
Regenerate solves that problem. :)

Is the re-generating still working for this problem? (it could also be caused by silence trimming made by SE)

0 replies

cvrle77 · 2024-04-21T20:21:39Z

cvrle77
Apr 21, 2024
Author

If you press OK in the TTS window, the edited subtitle will be loaded into the SE main window.

Yes, nice addition.

Continuing a workflow could be done by adding more and more voices.

I am not sure if I understand correctly what you wrote, but I'd like to be able to load the project, and just keep where I left off, fixing and regenerating the lines that I feel are awkwardly accented, or having weird gurgling in voice, or similar.
I see that ElevenLabs generates mp3 audio for each line, and Piper doesn't (or I might be mistaken).

Which is strange, because Piper has no issues with regenerating sentences (EDIT: yes it does, here's the error when I tried to do a final mix of audio).

ElevenLabs is freezing on regenerate, and regenerate function is useless for that part at the moment.

Can you try to download and overwrite the eleven-labs-voices.json file with the file from the api endpoint - does that give more voices?

Yes, I can and I did, but it's all one single line for some reason, and yes, it gives me three more voices that I've created.
You can get extra voices by going to elevenlabs.io/app/voice-library, add three voices, and they will show in VoiceLab.
Yes, I've downloaded them, and overwrote eleven-labs-voices.json file with the file from the api endpoint, and yes, it gives me three new voices.

But regenerate on ElevenLabs is still crashing when I try to edit something.

0 replies

niksedk · 2024-04-22T17:40:39Z

niksedk
Apr 22, 2024
Maintainer

Beta updated with a few minor fixes - it's should be possible to re-generate with elevenlabs

https://github.com/SubtitleEdit/subtitleedit/releases/download/4.0.5/SubtitleEditBeta.zip

0 replies

cvrle77 · 2024-04-22T18:45:47Z

cvrle77
Apr 22, 2024
Author

There is no trigger to refresh eleven-labs-voices.json. Neither API key change, or app opening are refreshing json file with updated voices. It's kinda important, but luckily there's workaround, so, not a matter of life and death atm.

EDIT: it is either not seeing new API key even though I see it, or it is pulling json with old key that is stored somewhere, but not overwritten. Voice ID is uniquely generated per user account, so if you have the same chosen voice like me, we have different IDs of the same voice dropped in json. I can see error message, and inside of it a string it is calling, and that it is trying to call old voice ID that was being used with different API key (I must test with multiple accounts=multiple API keys)

EDIT2: ok, no idea what triggered pulling json through API, but it did, I think, when I closed the app entirely.
I had issues because it would overwrite json using different API key, which means that, API key doesn't change everywhere when I enter it in the app, which is causing waves of strange behavior. Maybe add Save API key button that would change API key/refresh json?

EDIT3: nope, if I remove json file, it creates some default one without extra voices that I have on my account. And that old/new file has date from 15.4.2024.

There's still no means of saving TTS as some sort of project and reusing generated mp3 or wav without the need to regenerate everything and use the credit (this is really important option)
Aaaand there's this, weird part is, I can't close it, no button is working, credit is being eaten (which is why saving is so important)

0 replies

niksedk · 2024-04-23T18:13:56Z

niksedk
Apr 23, 2024
Maintainer

There's still no means of saving TTS as some sort of project and reusing generated mp3 or wav without the need to regenerate everything and use the credit (this is really important option)

Could be for version two... will take some time to develop.

Aaaand there's this, weird part is, I can't close it, no button is working

Hm, I cannot re-create this.
What about latest beta: https://github.com/SubtitleEdit/subtitleedit/releases/download/4.0.5/SubtitleEditBeta.zip ?

0 replies

cvrle77 · 2024-04-24T00:40:26Z

cvrle77
Apr 24, 2024
Author

Yes, ELabs regenerating single lines and merging them together now works without errors, as far as I've tested.

Could be for version two... will take some time to develop.

It would be of great help to see 'regenerate' window opened as it is, but without any TTS generated, so we can pick if we want a single line generated, several lines, or we want everything. This improves usability scenarios through the roof. We could play with different voices, do voiceovers on entire movies and do complex tasks, returning back to generate that single line of text that wasn't perfect...

0 replies

niksedk · 2024-04-26T03:42:03Z

niksedk
Apr 26, 2024
Maintainer

If you use ASSA format with actors, you can map actors to voices:

0 replies

cvrle77 · 2024-04-26T15:27:28Z

cvrle77
Apr 26, 2024
Author

You misunderstood me. The point was having 'Review audio clips' window opened, and no audio being generated.
Only from this new window, we could generate a single text line, multiple, all of them, change voices, etc..
This will greatly save the cost of trials and errors, because TTS is game of trial and error.

At the moment, we can change voices per line even without ASSA, which is supercool.
But TTS feature is good for play, and testing capabilities, not for starting projects.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text to speech issue #8259

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 13 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Text to speech issue #8259

cvrle77 Apr 20, 2024

Replies: 13 comments

cvrle77 Apr 20, 2024 Author

niksedk Apr 21, 2024 Maintainer

niksedk Apr 21, 2024 Maintainer

cvrle77 Apr 21, 2024 Author

cvrle77 Apr 21, 2024 Author

niksedk Apr 21, 2024 Maintainer

cvrle77 Apr 21, 2024 Author

niksedk Apr 22, 2024 Maintainer

cvrle77 Apr 22, 2024 Author

niksedk Apr 23, 2024 Maintainer

cvrle77 Apr 24, 2024 Author

niksedk Apr 26, 2024 Maintainer

cvrle77 Apr 26, 2024 Author

cvrle77
Apr 20, 2024

cvrle77
Apr 20, 2024
Author

niksedk
Apr 21, 2024
Maintainer

niksedk
Apr 21, 2024
Maintainer

cvrle77
Apr 21, 2024
Author

cvrle77
Apr 21, 2024
Author

niksedk
Apr 21, 2024
Maintainer

cvrle77
Apr 21, 2024
Author

niksedk
Apr 22, 2024
Maintainer

cvrle77
Apr 22, 2024
Author

niksedk
Apr 23, 2024
Maintainer

cvrle77
Apr 24, 2024
Author

niksedk
Apr 26, 2024
Maintainer

cvrle77
Apr 26, 2024
Author