Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supports continuous speech recognition and barge-in #5426

Merged
merged 44 commits into from
Feb 13, 2025

Conversation

compulim
Copy link
Contributor

@compulim compulim commented Feb 12, 2025

Fixes #2661. Fixes #5352.

Initial work done in #5397.

Changelog Entry

Added

  • Resolved #2661 and #5352. Added speech recognition continuous mode with barge-in support, in PR #5426, by @RushikeshGavali and @compulim
    • Set styleOptions.speechRecognitionContinuous to true with a Web Speech API provider with continuous mode support

Changed

Description

Continuous mode is designed for hands-off/kiosk scenario. End-users can hold a speech-primary conversation with the bot, and occasionally, interact with gestures (e.g. tapping on a card). Speech recognition will be kept active as long as possible, until end-user turn off speech recognition.

Added new styleOptions.speechRecognitionContinuous to enable continuous mode for speech recognition.

Design

  • Interactive mode: speech recognition is only active for minimal time, focus on privacy
  • Continuous mode: speech recognition will be active for as long as possible, durable over non-speech interactions, barge-in is supported, focus on hands-off experience

Behavioral differences

  • Continuous mode will not turn off microphone after speech is recognized
    • This is a behavior exhibited by the Web Speech API provider
      • Technically, Web Chat will not turn off microphone until end event is received, and not because result event is received
    • If Web Speech API provider does not support continuous mode, it should send end event after speech is recognized
  • While the bot response is synthesizing and input mode is "expecting input":
    • Interactive mode:
      • While synthesis is ongoing, speech recognition is paused
      • After synthesis has completed, speech recognition will be resumed
    • Continuous mode:
      • While synthesis is ongoing, speech recognition is continue to be active
      • When interim is recognized, synthesis will be interrupted (a.k.a. barge-in)
      • Logically, "expecting input" is ignored (speech recognition is always active and not paused)
  • While speech recognition is active, tap on card action
    • Interactive mode: will stop speech recognition, will not speak bot response
    • Continuous mode: will not stop speech recognition, will speak bot response
  • While speech recognition is active, receiving a bot message proactively
    • Interactive mode: will not synthesize the bot message
    • Continuous mode: will synthesize the bot message

Technical details

  • Web Chat relies on the correctness of the behavior of Web Speech API provider, including
    • Web Chat assume the microphone is on when start event is received
    • Web Chat assume the microphone is off when end event is received
  • Web Chat do not care about the SpeechRecognition.continuous property, but depends on the event dispatched by the Web Speech API provider
    • Microphone will be turned off when receiving end event
    • Microphone will send the message when receiving a result event with resultIndex pointing to a result which its isFinal property is true
      • event.results[event.resultIndex].isFinal === true
  • If interim is received, Web Chat will stop speech synthesis

Specific Changes

  • Added new styleOptions.speechRecognitionContinuous
  • I have added tests and executed them locally
  • I have updated CHANGELOG.md
  • I have updated documentation

Review Checklist

This section is for contributors to review your work.

  • Accessibility reviewed (tab order, content readability, alt text, color contrast)
  • Browser and platform compatibilities reviewed
  • CSS styles reviewed (minimal rules, no z-index)
  • Documents reviewed (docs, samples, live demo)
  • Internationalization reviewed (strings, unit formatting)
  • package.json and package-lock.json reviewed
  • Security reviewed (no data URIs, check for nonce leak)
  • Tests reviewed (coverage, legitimacy)

@compulim compulim changed the title DRAFT: Supports barge-in for speech recognition Supports continuous speech recognition and barge-in Feb 13, 2025
@compulim compulim marked this pull request as ready for review February 13, 2025 11:06
OEvgeny
OEvgeny previously approved these changes Feb 13, 2025
Copy link
Collaborator

@OEvgeny OEvgeny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks solid, couple of nits and questions

@compulim compulim merged commit c8c5744 into microsoft:main Feb 13, 2025
25 checks passed
@compulim compulim deleted the feat-speech-barge-in branch February 13, 2025 19:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Speech to text behavior [Tracking] DLS: Use continuous mode when Speech SDK support it
2 participants