-
Notifications
You must be signed in to change notification settings - Fork 345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Voice and text cannot be absolutely synchronized. #732
Comments
Which frames are you using for sending text and voice? An example code should help in understanding the issue better. |
@manish-baghel
As demonstrated in the code above, I want to return a The code above was changed to:
In this part, I defined a new |
@manish-baghel btw, would you please take a look at this issue as well. The solution provided in this issue #717 became unusable after version 0.0.48 because the handling of interruptions changed. However, the same problem still occurs after version 0.0.48. Specifically, if the user pauses for a slightly longer time while thinking, the subsequent sentences are completely lost (in fact, Azure does recognize the subsequent content, but since One idea is to increase the silence duration setting at the end of VAD, but this does not completely prevent the issue. This is because when interruptions are allowed (set to True), interruptions can still occur before the AI responds. For users, it seems like the AI recognized the speech but did not differentiate the content correctly. |
We'll have to think more about this. The limitation may be the TTS service...or at least that's something we'll have to work around. There are two services—CartesiaTTSService and ElevenLabsTTSService—that support word level timestamps. Because of this, you can get text and audio synchronized at the word level. Without this word level timestamp information from the TTS service, it's hard to do this precisely. If you're building a client/server app with RTVI, the word level information is easy to get. You just add an If you're not building with RTVI, then you can add a custom processor that emulates the same behavior to get access the ext as it's emitted. I know this isn't what you're asking for, but I'm sharing a solution that already works today, if you're willing to try a different TTS provider. We'll consider ways to make this word, but it's going to be difficult. |
It is cool, thank you for your comment! I will think about how to use RTVI to enhance my client. |
@fiyen I'm going to close this issue. @aconchillo and I discussed and the TTS service needs to provide word-level timestamps to provide this. Both the Cartesia and ElevenLabs TTS services support word-level timestamps, so you can try out one of those services. |
Description
It's not a bug, but a much-needed feature.
If reporting a bug, please fill out the following:
Environment
Issue description
I am using pipecat-ai as the backend, with the frontend receiving both voice and text. Due to various reasons, the frontend cannot stream voice playback smoothly, so my strategy is to concatenate a portion of the streamed voice from the backend into a larger segment before playback (I am using all the voice features provided by Azure). At the same time, to ensure there are no disruptions during playback, I will split a sentence into smaller units and, when the voice is transmitted back, send the text to the frontend to indicate the start and end of the voice. This helps the frontend know when to play the voice. Everything worked well before version 0.0.47, but after the introduction of version 0.0.48, I found that the start of voice playback became very chaotic. Even though I confirmed that the sequence of sending text and audio was correct, the frontend still couldn't accurately receive the correct order of text and audio at the right timing (the order of the audio is generally consistent, and so is the text, but the voice and text are mixed up).
Repro steps
Send voice and text in a specific order, and at the frontend, receive the voice and text to determine whether the order of the voice and text matches the order of transmission.
Expected behavior
Voice and text can be absolutely synchronized.
Actual behavior
Voice and text cannot be absolutely synchronized.
Logs
The text was updated successfully, but these errors were encountered: