Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Help]: MaskGCT's results were very strange #359

Open
WhiteNightMo opened this issue Nov 21, 2024 · 10 comments
Open

[Help]: MaskGCT's results were very strange #359

WhiteNightMo opened this issue Nov 21, 2024 · 10 comments

Comments

@WhiteNightMo
Copy link

WhiteNightMo commented Nov 21, 2024

Problem Overview

I modified this file models/tts/maskgct/maskgct_inference.py, changes are as follows:

    # inference
    prompt_wav_path = "./models/tts/maskgct/wav/5s.wav"
    save_path = "generated_audio7.wav"
    prompt_text = "想要交友吗?快来SOUL啊"
    target_text = "新用户真的可以享年化利率最低3.6%的优惠"
    # Specify the target duration (in seconds). If target_len = None, we use a simple rule to predict the target duration.
    target_len = None
    maskgct_inference_pipeline = MaskGCT_Inference_Pipeline(
        semantic_model,
        semantic_codec,
        codec_encoder,
        codec_decoder,
        t2s_model,
        s2a_model_1layer,
        s2a_model_full,
        semantic_mean,
        semantic_std,
        device,
    )

    recovered_audio = maskgct_inference_pipeline.maskgct_inference(
        prompt_wav_path, prompt_text, target_text, "zh", "zh", target_len=target_len
    )

    sf.write(save_path, recovered_audio, 24000)

Run command:

python -m models.tts.maskgct.maskgct_inference

The output did not meet my expectations.

My original file:
5s.zip

Output file:
generated_audio7.zip

@HeCheng0625
Copy link
Collaborator

It seems like the larget len is too long, you can specify the appropriate target length yourself.

@WhiteNightMo
Copy link
Author

It seems like the larget len is too long, you can specify the appropriate target length yourself.

I tried to change target_len to 8, but the output audio was missing the first 3 words and was slow overall. When I changed it to 10, the output audio read everything, but the speed was really slow.
10s.zip

@decajcd
Copy link

decajcd commented Nov 22, 2024

as missing the first 3 words and was slow overall. When I changed it to 10, the output audio read everything, but the speed was really slow.

请问解决了吗

@WhiteNightMo
Copy link
Author

as missing the first 3 words and was slow overall. When I changed it to 10, the output audio read everything, but the speed was really slow.

请问解决了吗

没呢,倒腾不出来

@decajcd
Copy link

decajcd commented Nov 22, 2024

as missing the first 3 words and was slow overall. When I changed it to 10, the output audio read everything, but the speed was really slow.

请问解决了吗

没呢,倒腾不出来

调不出来,要么太快要么胡说八道

@WhiteNightMo
Copy link
Author

as missing the first 3 words and was slow overall. When I changed it to 10, the output audio read everything, but the speed was really slow.

请问解决了吗

没呢,倒腾不出来

调不出来,要么太快要么胡说八道

难顶,我是要么太慢要么胡说八道

@decajcd
Copy link

decajcd commented Nov 22, 2024

as missing the first 3 words and was slow overall. When I changed it to 10, the output audio read everything, but the speed was really slow.

请问解决了吗

没呢,倒腾不出来

调不出来,要么太快要么胡说八道

难顶,我是要么太慢要么胡说八道

我还有背景音

@digitalboy
Copy link

有人解决了吗? Anybody fixed this?

@ruby11dog
Copy link

your prompt audio and prompt text are not matched completely

@wjinwei
Copy link

wjinwei commented Dec 5, 2024

这边合成出来的语音很多背景噪声, 只能大致听出内容和音色信息,查了很多发现用 从wav 里面提取 的sematic code 也会有这个问题,楼主的还正常些,可以放下conda list 看下版本信息吗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants