Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Voice and text cannot be absolutely synchronized. #732

Closed
fiyen opened this issue Nov 18, 2024 · 6 comments
Closed

Voice and text cannot be absolutely synchronized. #732

fiyen opened this issue Nov 18, 2024 · 6 comments

Comments

@fiyen
Copy link

fiyen commented Nov 18, 2024

Description

It's not a bug, but a much-needed feature.

If reporting a bug, please fill out the following:

Environment

  • pipecat-ai version: 0.0.48
  • python version: 3.11
  • OS: Windows

Issue description

I am using pipecat-ai as the backend, with the frontend receiving both voice and text. Due to various reasons, the frontend cannot stream voice playback smoothly, so my strategy is to concatenate a portion of the streamed voice from the backend into a larger segment before playback (I am using all the voice features provided by Azure). At the same time, to ensure there are no disruptions during playback, I will split a sentence into smaller units and, when the voice is transmitted back, send the text to the frontend to indicate the start and end of the voice. This helps the frontend know when to play the voice. Everything worked well before version 0.0.47, but after the introduction of version 0.0.48, I found that the start of voice playback became very chaotic. Even though I confirmed that the sequence of sending text and audio was correct, the frontend still couldn't accurately receive the correct order of text and audio at the right timing (the order of the audio is generally consistent, and so is the text, but the voice and text are mixed up).

Repro steps

Send voice and text in a specific order, and at the frontend, receive the voice and text to determine whether the order of the voice and text matches the order of transmission.

Expected behavior

Voice and text can be absolutely synchronized.

Actual behavior

Voice and text cannot be absolutely synchronized.

Logs

@manish-baghel
Copy link

Which frames are you using for sending text and voice? An example code should help in understanding the issue better.

@fiyen
Copy link
Author

fiyen commented Nov 20, 2024

@manish-baghel
In simple terms, I want some text to be sent to the frontend in strict sequence as specified. For example, I wish to send a marker indicating the end of an audio segment after each sentence, so my frontend can change the status display based on this marker, like switching from "responding" to "listening." To achieve this, I added a piece of code in AzureTTSService:

async def run_tts(self, text: str) -> AsyncGenerator[Frame, None]:
    async for frame in super().run_tts(text):
        yield frame
        # logger.debug(f"azure frame: {frame}")
    voice_text_frame = TransportMessageFrame(f"###VOICEPARTEND###{text}")
    await self.push_frame(voice_text_frame)

As demonstrated in the code above, I want to return a voice_text_frame after processing a sentence to mark the completion of the sentence and to include the text of the sentence as subtitles. However, since I can only mark when the AI's speech ends here, and the audio sent back to the frontend is further processed into audio blocks in outputTransport for transmission, I can't ensure that the audio and my text are sent in the exact order I require. My current method disrupts the code significantly. If possible, I would appreciate some examples of more elegant implementations. Here is how I modified my implementation:

The code above was changed to:

async def run_tts(self, text: str) -> AsyncGenerator[Frame, None]:
    async for frame in super().run_tts(text):
        yield frame
        # logger.debug(f"azure frame: {frame}")
    voice_text_frame = VoiceEndTextFrame(b'', self._settings["sample_rate"], 1, f"###VOICEPARTEND###{text}")
    await self.push_frame(voice_text_frame)

In this part, I defined a new VoiceEndTextFrame to encapsulate the end of a sentence, where VoiceEndTextFrame inherits from AudioRawFrame. This allows the next step to proceed smoothly in base_output. I added VoiceEndTextFrame to self._audio_out_queue in base_output so that when executing self._audio_out_task_handler, I can handle VoiceEndTextFrame separately. Also, I changed the line write_raw_audio_frames to write_raw_audio_frames(frame) instead of write_raw_audio_frames(frame.audio). This way, in write_raw_audio_frames, I can determine if a VoiceEndTextFrame has been passed in. If a VoiceEndTextFrame is passed, I let write_raw_audio_frames send text instead of audio streams. These changes involve significant modifications to the code, and while they achieve my functionality, I am concerned that I may have to continue making extensive changes in the future to adapt to new code updates. Therefore, I would like to know if there are more elegant solutions available.

@fiyen
Copy link
Author

fiyen commented Nov 20, 2024

@manish-baghel btw, would you please take a look at this issue as well. The solution provided in this issue #717 became unusable after version 0.0.48 because the handling of interruptions changed. However, the same problem still occurs after version 0.0.48. Specifically, if the user pauses for a slightly longer time while thinking, the subsequent sentences are completely lost (in fact, Azure does recognize the subsequent content, but since self._send_aggregation has already been executed, the newly recognized content is discarded).

One idea is to increase the silence duration setting at the end of VAD, but this does not completely prevent the issue. This is because when interruptions are allowed (set to True), interruptions can still occur before the AI responds. For users, it seems like the AI recognized the speech but did not differentiate the content correctly.

@markbackman
Copy link
Contributor

We'll have to think more about this. The limitation may be the TTS service...or at least that's something we'll have to work around. There are two services—CartesiaTTSService and ElevenLabsTTSService—that support word level timestamps. Because of this, you can get text and audio synchronized at the word level. Without this word level timestamp information from the TTS service, it's hard to do this precisely.

If you're building a client/server app with RTVI, the word level information is easy to get. You just add an RTVIBotTTSProcessor (link) to your pipeline after the TTS service. Then, on the client-side, you handle BotTtsText messages.

If you're not building with RTVI, then you can add a custom processor that emulates the same behavior to get access the ext as it's emitted.

I know this isn't what you're asking for, but I'm sharing a solution that already works today, if you're willing to try a different TTS provider. We'll consider ways to make this word, but it's going to be difficult.

@fiyen
Copy link
Author

fiyen commented Nov 21, 2024

We'll have to think more about this. The limitation may be the TTS service...or at least that's something we'll have to work around. There are two services—CartesiaTTSService and ElevenLabsTTSService—that support word level timestamps. Because of this, you can get text and audio synchronized at the word level. Without this word level timestamp information from the TTS service, it's hard to do this precisely.

If you're building a client/server app with RTVI, the word level information is easy to get. You just add an RTVIBotTTSProcessor (link) to your pipeline after the TTS service. Then, on the client-side, you handle BotTtsText messages.

If you're not building with RTVI, then you can add a custom processor that emulates the same behavior to get access the ext as it's emitted.

I know this isn't what you're asking for, but I'm sharing a solution that already works today, if you're willing to try a different TTS provider. We'll consider ways to make this word, but it's going to be difficult.

It is cool, thank you for your comment! I will think about how to use RTVI to enhance my client.

@markbackman
Copy link
Contributor

@fiyen I'm going to close this issue. @aconchillo and I discussed and the TTS service needs to provide word-level timestamps to provide this. Both the Cartesia and ElevenLabs TTS services support word-level timestamps, so you can try out one of those services.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants