Problem with italian dataset creation #256

diegobernagozzi · 2025-03-11T09:20:35Z

Hi everyone,

im trying to create an italian dataset for melotts italian training, but when i start the training it produces only unintelligible sounds. I think that the issue it could be in the files like metadata.list.cleaned, train.list and val.list.

After running

python3 preprocess_text.py --metadata data/example/metadata.list

the phonemes of many phrases in the file are like this:

../../audio_1/Untitled_MIC_1_995.wav|Italian|IT|Ogni giorno mi avvicino sempre di più al mio sogno|_ ˈ o ɲ ɲ i ˈ d ͡ ʒ o r n o m i a v v i t ͡ ʃ i n o ˈ s ɛ m p r e ˈ d i p j u a l ˈ m i o s o ɲ o _|0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0|1 5 8 2 5 5 7 3 3 2 4 4 1

gruut generate this symbol ͡ that the training code treat as a stand alone symbol, and maybe cause problem, because if i running this gruut test code (from gruut repo):

from gruut import sentences

ssml_text = """<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
    xml:lang="en-US">
<s>Today at 4pm, 2/1/2000.</s>
<s xml:lang="it">Ogni giorno mi avvicino sempre di più al mio sogno</s>
</speak>"""

for sent in sentences(ssml_text, ssml=True):
    for word in sent:
        if word.phonemes:
            print(sent.idx, word.lang, word.text, *word.phonemes)

the italian output is like this:

1 it Ogni ˈo ɲ ɲ i
1 it giorno ˈd͡ʒ o r n o
1 it mi m i
1 it avvicino a v v i t͡ʃ i n o
1 it sempre ˈs ɛ m p r e
1 it di ˈd i
1 it più ˈp j u
1 it al a l
1 it mio ˈm i o
1 it sogno s o ɲ o

it seems that the ͡ is not a stand alone symbol, and i'd like to know how to handle this symbol in the melo tts code.

Thanks in advance,

Diego

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem with italian dataset creation #256

Problem with italian dataset creation #256

diegobernagozzi commented Mar 11, 2025

Problem with italian dataset creation #256

Problem with italian dataset creation #256

Comments

diegobernagozzi commented Mar 11, 2025