Voice and text cannot be absolutely synchronized. #732

fiyen · 2024-11-18T07:28:25Z

Description

It's not a bug, but a much-needed feature.

If reporting a bug, please fill out the following:

Environment

pipecat-ai version: 0.0.48
python version: 3.11
OS: Windows

Issue description

I am using pipecat-ai as the backend, with the frontend receiving both voice and text. Due to various reasons, the frontend cannot stream voice playback smoothly, so my strategy is to concatenate a portion of the streamed voice from the backend into a larger segment before playback (I am using all the voice features provided by Azure). At the same time, to ensure there are no disruptions during playback, I will split a sentence into smaller units and, when the voice is transmitted back, send the text to the frontend to indicate the start and end of the voice. This helps the frontend know when to play the voice. Everything worked well before version 0.0.47, but after the introduction of version 0.0.48, I found that the start of voice playback became very chaotic. Even though I confirmed that the sequence of sending text and audio was correct, the frontend still couldn't accurately receive the correct order of text and audio at the right timing (the order of the audio is generally consistent, and so is the text, but the voice and text are mixed up).

Repro steps

Send voice and text in a specific order, and at the frontend, receive the voice and text to determine whether the order of the voice and text matches the order of transmission.

Expected behavior

Voice and text can be absolutely synchronized.

Actual behavior

Voice and text cannot be absolutely synchronized.

Logs

manish-baghel · 2024-11-19T13:14:13Z

Which frames are you using for sending text and voice? An example code should help in understanding the issue better.

fiyen · 2024-11-20T13:38:42Z

@manish-baghel
In simple terms, I want some text to be sent to the frontend in strict sequence as specified. For example, I wish to send a marker indicating the end of an audio segment after each sentence, so my frontend can change the status display based on this marker, like switching from "responding" to "listening." To achieve this, I added a piece of code in AzureTTSService:

async def run_tts(self, text: str) -> AsyncGenerator[Frame, None]:
    async for frame in super().run_tts(text):
        yield frame
        # logger.debug(f"azure frame: {frame}")
    voice_text_frame = TransportMessageFrame(f"###VOICEPARTEND###{text}")
    await self.push_frame(voice_text_frame)

As demonstrated in the code above, I want to return a voice_text_frame after processing a sentence to mark the completion of the sentence and to include the text of the sentence as subtitles. However, since I can only mark when the AI's speech ends here, and the audio sent back to the frontend is further processed into audio blocks in outputTransport for transmission, I can't ensure that the audio and my text are sent in the exact order I require. My current method disrupts the code significantly. If possible, I would appreciate some examples of more elegant implementations. Here is how I modified my implementation:

The code above was changed to:

async def run_tts(self, text: str) -> AsyncGenerator[Frame, None]:
    async for frame in super().run_tts(text):
        yield frame
        # logger.debug(f"azure frame: {frame}")
    voice_text_frame = VoiceEndTextFrame(b'', self._settings["sample_rate"], 1, f"###VOICEPARTEND###{text}")
    await self.push_frame(voice_text_frame)

In this part, I defined a new VoiceEndTextFrame to encapsulate the end of a sentence, where VoiceEndTextFrame inherits from AudioRawFrame. This allows the next step to proceed smoothly in base_output. I added VoiceEndTextFrame to self._audio_out_queue in base_output so that when executing self._audio_out_task_handler, I can handle VoiceEndTextFrame separately. Also, I changed the line write_raw_audio_frames to write_raw_audio_frames(frame) instead of write_raw_audio_frames(frame.audio). This way, in write_raw_audio_frames, I can determine if a VoiceEndTextFrame has been passed in. If a VoiceEndTextFrame is passed, I let write_raw_audio_frames send text instead of audio streams. These changes involve significant modifications to the code, and while they achieve my functionality, I am concerned that I may have to continue making extensive changes in the future to adapt to new code updates. Therefore, I would like to know if there are more elegant solutions available.

fiyen · 2024-11-20T13:57:58Z

@manish-baghel btw, would you please take a look at this issue as well. The solution provided in this issue #717 became unusable after version 0.0.48 because the handling of interruptions changed. However, the same problem still occurs after version 0.0.48. Specifically, if the user pauses for a slightly longer time while thinking, the subsequent sentences are completely lost (in fact, Azure does recognize the subsequent content, but since self._send_aggregation has already been executed, the newly recognized content is discarded).

One idea is to increase the silence duration setting at the end of VAD, but this does not completely prevent the issue. This is because when interruptions are allowed (set to True), interruptions can still occur before the AI responds. For users, it seems like the AI recognized the speech but did not differentiate the content correctly.

markbackman · 2024-11-21T13:34:52Z

We'll have to think more about this. The limitation may be the TTS service...or at least that's something we'll have to work around. There are two services—CartesiaTTSService and ElevenLabsTTSService—that support word level timestamps. Because of this, you can get text and audio synchronized at the word level. Without this word level timestamp information from the TTS service, it's hard to do this precisely.

If you're building a client/server app with RTVI, the word level information is easy to get. You just add an RTVIBotTTSProcessor (link) to your pipeline after the TTS service. Then, on the client-side, you handle BotTtsText messages.

If you're not building with RTVI, then you can add a custom processor that emulates the same behavior to get access the ext as it's emitted.

I know this isn't what you're asking for, but I'm sharing a solution that already works today, if you're willing to try a different TTS provider. We'll consider ways to make this word, but it's going to be difficult.

fiyen · 2024-11-21T14:25:13Z

We'll have to think more about this. The limitation may be the TTS service...or at least that's something we'll have to work around. There are two services—CartesiaTTSService and ElevenLabsTTSService—that support word level timestamps. Because of this, you can get text and audio synchronized at the word level. Without this word level timestamp information from the TTS service, it's hard to do this precisely.

If you're building a client/server app with RTVI, the word level information is easy to get. You just add an RTVIBotTTSProcessor (link) to your pipeline after the TTS service. Then, on the client-side, you handle BotTtsText messages.

If you're not building with RTVI, then you can add a custom processor that emulates the same behavior to get access the ext as it's emitted.

I know this isn't what you're asking for, but I'm sharing a solution that already works today, if you're willing to try a different TTS provider. We'll consider ways to make this word, but it's going to be difficult.

It is cool, thank you for your comment! I will think about how to use RTVI to enhance my client.

markbackman · 2024-12-03T17:40:16Z

@fiyen I'm going to close this issue. @aconchillo and I discussed and the TTS service needs to provide word-level timestamps to provide this. Both the Cartesia and ElevenLabs TTS services support word-level timestamps, so you can try out one of those services.

markbackman closed this as completed Dec 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Voice and text cannot be absolutely synchronized. #732

Voice and text cannot be absolutely synchronized. #732

fiyen commented Nov 18, 2024

manish-baghel commented Nov 19, 2024

fiyen commented Nov 20, 2024 •

edited

Loading

fiyen commented Nov 20, 2024

markbackman commented Nov 21, 2024

fiyen commented Nov 21, 2024

markbackman commented Dec 3, 2024

Voice and text cannot be absolutely synchronized. #732

Voice and text cannot be absolutely synchronized. #732

Comments

fiyen commented Nov 18, 2024

Description

Environment

Issue description

Repro steps

Expected behavior

Actual behavior

Logs

manish-baghel commented Nov 19, 2024

fiyen commented Nov 20, 2024 • edited Loading

fiyen commented Nov 20, 2024

markbackman commented Nov 21, 2024

fiyen commented Nov 21, 2024

markbackman commented Dec 3, 2024

fiyen commented Nov 20, 2024 •

edited

Loading