When allow_interruption is set to True, the bot may emit some sentences this turn but say some sentences last turn. #754

fiyen · 2024-11-28T04:45:29Z

Description

Is this reporting a bug or feature request?
A BUG

If reporting a bug, please fill out the following:

Environment

pipecat-ai version: 0.0.49
python version: 3.11
OS: Ubuntu

Issue description

When allow_interruption is set to True, there is a chance of audio lag occurring during the conversation, manifested as follows: when the user interrupts the Bot while it is speaking, the Bot's unfinished speech may continue after the user has finished speaking. Then, the Bot's current turn may have a sentence omitted, which will only be spoken in the next turn. The mechanism causing this issue has not been clarified to date.

Repro steps

Set allow_interruption = True，use AzureTTSService as tts, and speak with bot freely, this will happen randomly. I can figure out the chance to happen, but it will surely happen when you have been talking for time long enough.

Expected behavior

The bot says what it should say this turn, does not say the sentence in last turn.

Actual behavior

The bot emit one or more than one last sentences this turn, and say some sentences being interrupted in last turn.

Logs

No logs, I am not sure the conditions when it happens.

fiyen · 2024-11-28T09:22:24Z

I have found the right chance to cause this bug. You talk and interrupt the process just when the tts module is processing the speech chunks, then the sentence just being processed will be the part to be played in the next turn. I think the reason may be the _handle_interruption function do not filter the speech chunks of this sentence produced by tts module, and let them sent to frontend. But the confusion is that why this can cause missing speech chunks of new sentences in next turn.

jamsea · 2024-12-02T09:02:06Z

This looks related to the work being done here: #721

fiyen · 2024-12-02T11:59:42Z

This looks related to the work being done here: #721

Thank you very much for your guidance. I tried updating the code and running my program, but it didn't work. Your insights are very valuable, and I will try to find a solution to the problem.

fiyen · 2024-12-02T12:14:36Z

I have found the right chance to cause this bug. You talk and interrupt the process just when the tts module is processing the speech chunks, then the sentence just being processed will be the part to be played in the next turn. I think the reason may be the _handle_interruption function do not filter the speech chunks of this sentence produced by tts module, and let them sent to frontend. But the confusion is that why this can cause missing speech chunks of new sentences in next turn.

I tried several methods, and my initial guess was that after an interruption, the TTS might not stop immediately. To address this, I added logs in the TTS code to print all frames returned by the TTS. However, things didn't turn out as I expected. After an interruption occurs, my logs stop printing, which means subsequent frames are not being pushed. Here's where I set up the logs:

class FilterAzureTTSService(AzureTTSService):
    async def run_tts(self, text: str) -> AsyncGenerator[Frame, None]:
        try:
            voice_text_frame = VoiceEndTextFrame(
                b'', self._settings["sample_rate"], 1, f"###VOICEPARTEND###{text}")
            await self.push_frame(voice_text_frame)

            async for frame in super().run_tts(text):
                # Yield each frame to the caller
                yield frame
                logger.debug(f"azure frame: {frame}, {frame.id}")

            voice_text_frame = VoiceEndTextFrame(b'', self._settings["sample_rate"], 1, f"###VOICEPARTEND###")
            await self.push_frame(voice_text_frame)
        except asyncio.CancelledError:
            # Handle cleanup if necessary
            logger.debug("run_tts was cancelled.")
            raise  # Re-raise to handle or terminate properly

This is quite puzzling because the audio being generated is indeed sent to the frontend in the next turn and played. Here's an example:

In the previous round, "I am happy to see you" is being generated, but I interrupted it during the TTS audio generation process. At this point, the logger.debug(f"azure frame: {frame}, {frame.id}") in my code stops producing output, and subsequent push_frame calls are not executed. No more audio is sent to the frontend, and the playback ends. After I finish speaking, the next response should be: "Nice to meet you too. What can I do for you." However, before saying this, the audio for the previous "I am happy to see you" is played again, and the current "What can I do for you" is not played. It will only be played in the next cycle. From then on, in each round, the unfinished sentence from the previous round is played first, followed by the new sentence, creating a vicious cycle.

markbackman · 2024-12-03T17:37:11Z

@fiyen what transport are you using?

fiyen · 2024-12-04T00:19:21Z

@fiyen what transport are you using?
I am using the FastAPIWebsocketTransport in pipecat.transports.network.fastapi_websocket.py. The followings are my codes:

async def run_pipeline(websocket: WebSocket):
    conn = aiohttp.TCPConnector(ssl=ssl_context)
    async with aiohttp.ClientSession(connector=conn) as session:
        transport = FastAPIWebsocketTransport(
            websocket=websocket,
            params=FastAPIWebsocketParams(
                audio_out_enabled=True,
                add_wav_header=False,
                vad_enabled=True,
                vad_analyzer=SileroVADAnalyzer(params=VADParams(
                    confidence = 0.8,
                    start_secs = 0.2,
                    stop_secs = 0.4,
                    min_volume = 0.7
                )),
                audio_out_sample_rate=16000,
                audio_in_sample_rate=16000,
                vad_audio_passthrough=True,
                serializer=ProtobufFrameSerializer(),
                audio_in_filter=NoisereduceFilter()
            )
        )

        llm = OpenAILLMService(
            api_key=os.getenv("OPENAI_APIKEY"),
            base_url=os.getenv("OPENAI_BASEURL"),
            model="fast-gpt-4o-mini",
            params=OpenAILLMService.InputParams(
                frequency_penalty=0.5, # 减少词汇重复
                presence_penalty=0, # 保持话题一致
                seed=31, # 固定seed避免不同会话之间的随机变化
                temperature=0.3, # 减少随机性
                top_p=0.8,
            ))
        
        tools = []

        for tool_name in tool_list:
            tool = tool_box.get_tool(tool_name)
            if tool_name == 'fetch_next_sentence':
                tool.add_tool_func(fetch_next_sentence_from_api)
            if tool is not None:
                llm.register_function(
                    None,
                    tool.tool_func,
                    start_callback=tool.tool_start_callback)
                tools.append(tool.tool_param)


        stt = AzureSTTService(
            api_key=os.getenv("AZURE_SPEECH_API"),
            region=os.getenv("AZURE_SPEECH_REGION"),
            sample_rate=16000
        )

        tts = FilterAzureTTSService(
            aiohttp_session=session,
            api_key=os.getenv("AZURE_SPEECH_API"),
            region=os.getenv("AZURE_SPEECH_REGION"),
            voice="en-US-EmmaMultilingualNeural",
            sample_rate=16000,
        )


        messages = []

        # avt = AudioVolumeTimer()
        # tl = TranscriptionTimingLogger(avt)

        context = OpenAILLMContext(messages, tools, tool_choice='auto')
        context_aggregator = llm.create_context_aggregator(context)


        pipeline = Pipeline([
            transport.input(),   # Websocket input from client
            # avt,                 # Audio volume timer
            stt,                 # Speech-To-Text
            # tl,                  # Transcription timing logger
            context_aggregator.user(),
            llm,                 # LLM
            tts,                 # Text-To-Speech
            context_aggregator.assistant(),
            transport.output(),  # Websocket output to client
        ])

        task = PipelineTask(
            pipeline,
            PipelineParams(
                allow_interruptions=True
            ))

markbackman · 2024-12-04T03:16:15Z

@aconchillo you might want to take a look.

fiyen · 2024-12-05T04:54:14Z

By changing azuretts to other kind of tts, like elevenlabs or deepgram, this problem can be solved. So this problem may be caused by the azure tts module.

markbackman assigned aconchillo Dec 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When allow_interruption is set to True, the bot may emit some sentences this turn but say some sentences last turn. #754

When allow_interruption is set to True, the bot may emit some sentences this turn but say some sentences last turn. #754

fiyen commented Nov 28, 2024

fiyen commented Nov 28, 2024 •

edited

Loading

jamsea commented Dec 2, 2024

fiyen commented Dec 2, 2024

fiyen commented Dec 2, 2024

markbackman commented Dec 3, 2024

fiyen commented Dec 4, 2024

markbackman commented Dec 4, 2024

fiyen commented Dec 5, 2024

When allow_interruption is set to True, the bot may emit some sentences this turn but say some sentences last turn. #754

When allow_interruption is set to True, the bot may emit some sentences this turn but say some sentences last turn. #754

Comments

fiyen commented Nov 28, 2024

Description

Environment

Issue description

Repro steps

Expected behavior

Actual behavior

Logs

fiyen commented Nov 28, 2024 • edited Loading

jamsea commented Dec 2, 2024

fiyen commented Dec 2, 2024

fiyen commented Dec 2, 2024

markbackman commented Dec 3, 2024

fiyen commented Dec 4, 2024

markbackman commented Dec 4, 2024

fiyen commented Dec 5, 2024

fiyen commented Nov 28, 2024 •

edited

Loading