Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When allow_interruption is set to True, the bot may emit some sentences this turn but say some sentences last turn. #754

Open
fiyen opened this issue Nov 28, 2024 · 8 comments
Assignees

Comments

@fiyen
Copy link

fiyen commented Nov 28, 2024

Description

Is this reporting a bug or feature request?
A BUG

If reporting a bug, please fill out the following:

Environment

  • pipecat-ai version: 0.0.49
  • python version: 3.11
  • OS: Ubuntu

Issue description

When allow_interruption is set to True, there is a chance of audio lag occurring during the conversation, manifested as follows: when the user interrupts the Bot while it is speaking, the Bot's unfinished speech may continue after the user has finished speaking. Then, the Bot's current turn may have a sentence omitted, which will only be spoken in the next turn. The mechanism causing this issue has not been clarified to date.

Repro steps

Set allow_interruption = True,use AzureTTSService as tts, and speak with bot freely, this will happen randomly. I can figure out the chance to happen, but it will surely happen when you have been talking for time long enough.

Expected behavior

The bot says what it should say this turn, does not say the sentence in last turn.

Actual behavior

The bot emit one or more than one last sentences this turn, and say some sentences being interrupted in last turn.

Logs

No logs, I am not sure the conditions when it happens.

@fiyen
Copy link
Author

fiyen commented Nov 28, 2024

I have found the right chance to cause this bug. You talk and interrupt the process just when the tts module is processing the speech chunks, then the sentence just being processed will be the part to be played in the next turn. I think the reason may be the _handle_interruption function do not filter the speech chunks of this sentence produced by tts module, and let them sent to frontend. But the confusion is that why this can cause missing speech chunks of new sentences in next turn.

@jamsea
Copy link
Contributor

jamsea commented Dec 2, 2024

This looks related to the work being done here: #721

@fiyen
Copy link
Author

fiyen commented Dec 2, 2024

This looks related to the work being done here: #721

Thank you very much for your guidance. I tried updating the code and running my program, but it didn't work. Your insights are very valuable, and I will try to find a solution to the problem.

@fiyen
Copy link
Author

fiyen commented Dec 2, 2024

I have found the right chance to cause this bug. You talk and interrupt the process just when the tts module is processing the speech chunks, then the sentence just being processed will be the part to be played in the next turn. I think the reason may be the _handle_interruption function do not filter the speech chunks of this sentence produced by tts module, and let them sent to frontend. But the confusion is that why this can cause missing speech chunks of new sentences in next turn.

I tried several methods, and my initial guess was that after an interruption, the TTS might not stop immediately. To address this, I added logs in the TTS code to print all frames returned by the TTS. However, things didn't turn out as I expected. After an interruption occurs, my logs stop printing, which means subsequent frames are not being pushed. Here's where I set up the logs:

class FilterAzureTTSService(AzureTTSService):
    async def run_tts(self, text: str) -> AsyncGenerator[Frame, None]:
        try:
            voice_text_frame = VoiceEndTextFrame(
                b'', self._settings["sample_rate"], 1, f"###VOICEPARTEND###{text}")
            await self.push_frame(voice_text_frame)

            async for frame in super().run_tts(text):
                # Yield each frame to the caller
                yield frame
                logger.debug(f"azure frame: {frame}, {frame.id}")

            voice_text_frame = VoiceEndTextFrame(b'', self._settings["sample_rate"], 1, f"###VOICEPARTEND###")
            await self.push_frame(voice_text_frame)
        except asyncio.CancelledError:
            # Handle cleanup if necessary
            logger.debug("run_tts was cancelled.")
            raise  # Re-raise to handle or terminate properly

This is quite puzzling because the audio being generated is indeed sent to the frontend in the next turn and played. Here's an example:

In the previous round, "I am happy to see you" is being generated, but I interrupted it during the TTS audio generation process. At this point, the logger.debug(f"azure frame: {frame}, {frame.id}") in my code stops producing output, and subsequent push_frame calls are not executed. No more audio is sent to the frontend, and the playback ends. After I finish speaking, the next response should be: "Nice to meet you too. What can I do for you." However, before saying this, the audio for the previous "I am happy to see you" is played again, and the current "What can I do for you" is not played. It will only be played in the next cycle. From then on, in each round, the unfinished sentence from the previous round is played first, followed by the new sentence, creating a vicious cycle.

@markbackman
Copy link
Contributor

@fiyen what transport are you using?

@fiyen
Copy link
Author

fiyen commented Dec 4, 2024

@fiyen what transport are you using?
I am using the FastAPIWebsocketTransport in pipecat.transports.network.fastapi_websocket.py. The followings are my codes:

async def run_pipeline(websocket: WebSocket):
    conn = aiohttp.TCPConnector(ssl=ssl_context)
    async with aiohttp.ClientSession(connector=conn) as session:
        transport = FastAPIWebsocketTransport(
            websocket=websocket,
            params=FastAPIWebsocketParams(
                audio_out_enabled=True,
                add_wav_header=False,
                vad_enabled=True,
                vad_analyzer=SileroVADAnalyzer(params=VADParams(
                    confidence = 0.8,
                    start_secs = 0.2,
                    stop_secs = 0.4,
                    min_volume = 0.7
                )),
                audio_out_sample_rate=16000,
                audio_in_sample_rate=16000,
                vad_audio_passthrough=True,
                serializer=ProtobufFrameSerializer(),
                audio_in_filter=NoisereduceFilter()
            )
        )

        llm = OpenAILLMService(
            api_key=os.getenv("OPENAI_APIKEY"),
            base_url=os.getenv("OPENAI_BASEURL"),
            model="fast-gpt-4o-mini",
            params=OpenAILLMService.InputParams(
                frequency_penalty=0.5, # 减少词汇重复
                presence_penalty=0, # 保持话题一致
                seed=31, # 固定seed避免不同会话之间的随机变化
                temperature=0.3, # 减少随机性
                top_p=0.8,
            ))
        
        tools = []

        for tool_name in tool_list:
            tool = tool_box.get_tool(tool_name)
            if tool_name == 'fetch_next_sentence':
                tool.add_tool_func(fetch_next_sentence_from_api)
            if tool is not None:
                llm.register_function(
                    None,
                    tool.tool_func,
                    start_callback=tool.tool_start_callback)
                tools.append(tool.tool_param)


        stt = AzureSTTService(
            api_key=os.getenv("AZURE_SPEECH_API"),
            region=os.getenv("AZURE_SPEECH_REGION"),
            sample_rate=16000
        )

        tts = FilterAzureTTSService(
            aiohttp_session=session,
            api_key=os.getenv("AZURE_SPEECH_API"),
            region=os.getenv("AZURE_SPEECH_REGION"),
            voice="en-US-EmmaMultilingualNeural",
            sample_rate=16000,
        )


        messages = []

        # avt = AudioVolumeTimer()
        # tl = TranscriptionTimingLogger(avt)

        context = OpenAILLMContext(messages, tools, tool_choice='auto')
        context_aggregator = llm.create_context_aggregator(context)


        pipeline = Pipeline([
            transport.input(),   # Websocket input from client
            # avt,                 # Audio volume timer
            stt,                 # Speech-To-Text
            # tl,                  # Transcription timing logger
            context_aggregator.user(),
            llm,                 # LLM
            tts,                 # Text-To-Speech
            context_aggregator.assistant(),
            transport.output(),  # Websocket output to client
        ])

        task = PipelineTask(
            pipeline,
            PipelineParams(
                allow_interruptions=True
            ))

@markbackman
Copy link
Contributor

@aconchillo you might want to take a look.

@fiyen
Copy link
Author

fiyen commented Dec 5, 2024

By changing azuretts to other kind of tts, like elevenlabs or deepgram, this problem can be solved. So this problem may be caused by the azure tts module.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants