Yesterday, I experimented with Whisper to understand what’s needed to develop a voice chat bot. The biggest challenge is that it struggles to determine when to start talking, often resulting in long, awkward pauses. Sometimes, it even interrupts me before I’ve finished speaking.

I believe one potential solution is to utilize a Large Language Model (LLM) to analyze speech-to-text transcripts for cues that indicate when the AI should respond. In everyday conversations, we rely on various non-verbal cues like tone changes, posture shifts, gestures, and eye or eyebrow movements. Currently, AI can only recognize verbal cues, but we could also examine the speaker’s pitch to detect rising or falling intonations as additional indicators.

For now, I plan to use a keyboard button like Ctrl to signal to the bot when I want to speak, rather than relying on continuous voice activity detection.

Next - Previous