-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wav2vec features #1
Comments
Will work on it in the next few days. Will keep you posted. |
thanks |
out of curiousity whats the total variety of languages wav2vec, how would have compare to like wavlm? |
Both models, wav2vec and WavLM, are widely recognized for their capabilities in Automatic Speech Recognition (ASR) tasks and they utilize the LibriSpeech Dataset for training purposes. However, when it comes to audio feature extraction, wav2vec stands out as the preferred choice as it is an end2end approach, beside its flexibly and the architecture which is designed to extract high-level speech features directly from raw audio waveforms. In contrast, WavLM is primarily oriented towards the generation of speech from text inputs. |
How was wav2vecDS.pt trained? If I want to train my own dataset for wav2vecDS.pt. I tried using other pre-trained wav2vec.pt models, but they don’t seem to work. |
@qiu8888 To avoid any confusion, wav2vecDS.pt is a torch model which I trained using the class _Wav2vecDS to learn a mapping between wav2vec features to deep speech features. This way I can use wav2vec ASR models (in this case HubertASR )with the trained DINet model without causing any issues, since the model was trained on DeepSpeech v0.1.0 and it is very slow. However, I will update the readme and add instructions anyway for training the mapping model on your own dataset in the next few days if needed. But keep in mind that I am not retraining the wav2vec ASR model itself, only the mapping is needed here. For more Information about the wav2vec ASR models and how these models were trained, please refer to their documentation here. |
Hi, the work released? |
@lidachuan211 @QuantJia I could find a solution for the Chinese but it doesn't work well for other languages. For the moment trying a different solution by using newer versions of DeepSpeech and converting the model to onnx for inference optimization. I am working only for few time unfortunately on this but will share a solution hopefully soon. |
@Elsaam2y did you solved for onnx for inference optimization? |
I tried to export DINet to onnx, but I couldn't. I finally used pytorch script, anyway the performance is the same. The only advantage is that I can load it to nvidia triton server.
for inference you can replace this: For onnx I tried this, but didn't work: `
|
@Luckygyana @davidmartinrius the onnx conversion for this model is a bit tricky since it contains some operations which are not supported by onnx and requires some modifications to make it work. Furthermore, it won't boost the inference speed significantly and will be almost the same since native torch models are already fast, unless you were planning to integrate it with some other models and prefer to use onnx for ease of development. |
I have retrained SyncNet by mapping the wav2vec features with deep speech features using your wav2vecDS.pt model, and the synchronization performance has improved somewhat. However, I would like to try the latest DeepSpeech model, but it has significant differences in parameters and output structure compared to v0.10. Can you help with that? |
@qiu8888 I am working at the moment with deep speech 0.6.0. If my tests passed fine, I will prepare a new mapping and push it. Which version did you try for deepspeech, 0.9.1? and did you notice significant difference wrt speed compared to 0.1.0? |
any luck with your tests? |
Hi @Elsaam2y, Have you tried completely replacing the deep speech features with wav2vec2 features and retraining SyncNet and DINet with that? |
@9bitss unfortunately need to replace and retrain DINet with the latest model for deepspeech. The first try didn't work fine and the model didn't perform well. One alternative solution is to learn mapping from the latest deepspeech model to the current one used to avoid retraining. Didn't have time to test this yet but it should work theoritically. |
@ketyi with wav2vec we will need model per language as the one for english won't perform well on other languages. This would add more complexity to the pipeline and hence I tried retraining but with latest version of deepspeech as it supports onnx and gpu. |
@Elsaam2y but you are already using wav2vec in the pipeline, so I don't get your point. |
@ketyi Sorry for my late response. I mean that retraining the syncnet and the model on wav2vec features would still have some issues regarding the generalization. When I used wav2vec I didn't realize at the beginning its problems with some languages, and hence was focusing recently on updating the model to use the latest version of deep speech instead. |
really looking forward to the latest deepspeech whats the ETA on training the mapping to work with it? |
At the moment I am quite busy with some other projects so I would give it an estimation of few weeks. |
Big guy, what's the progress, when Shmoa support onnx? |
很好奇,为什么推理时使用 wav2vec + wav2vecDS生成的音频特征,但是训练时用的却是 deepspeech。两者不应该都采用 wav2vec + wav2vecDS吗?看了上面的记录,好像是训练时使用 wav2vec + wav2vecDS对中文支持好了,但是其他语言又变差了,不知道我理解的对不对。如果只需要支持中文,是不是训练和推理时都使用 wav2vec + wav2vecDS,效果会比较好 I'm curious why the audio features generated by wav2vec + wav2vecDS are used during inference, but deepspeech is used during training. Shouldn't both be using wav2vec + wav2vecDS? After reading the above records, it seems that using wav2vec + wav2vecDS during training has improved support for Chinese, but other languages have become worse. I don’t know if I understand it correctly. If you only need to support Chinese, should you use wav2vec + wav2vecDS for both training and inference? The effect will be better |
我尝试过多个版本的 DeepSpeech的pb,发现他们的输出维度和v0.1是不同的,应该是从某个版本开始就发生了改变。训练时报错 我尝试的 DeepSpeech版本有这些
pbmm转pb的工具 |
请问你训练 wav2vecDS用的是什么数据集,我如果要用自己的数据集来训练,需要如何制作数据集呢?另外对数据集是否有语言要求 May I ask which dataset you used to train wav2vecDS? If I want to use my own data set for training, how do I make it? In addition, are there any language requirements for the dataset? |
audio wav2vec features support for Chinese ?
The text was updated successfully, but these errors were encountered: