Skip to content

feat: streaming in speech to text transcription #168

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: v0.4.0-rc1
Choose a base branch
from

Conversation

mdydek
Copy link

@mdydek mdydek commented Apr 1, 2025

Description

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update (improves or adds clarity to existing documentation)

Tested on

  • iOS
  • Android

Testing instructions

Screenshots

Related issues

Checklist

  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have updated the documentation accordingly
  • My changes generate no new warnings

Additional notes

@mdydek mdydek requested a review from mkopcins April 1, 2025 08:32
@mkopcins mkopcins changed the title @md/s2t streaming feat: @md/s2t streaming Apr 1, 2025
@mdydek mdydek changed the title feat: @md/s2t streaming feat: streaming in speech to text transcription Apr 1, 2025
@mdydek mdydek force-pushed the @md/s2t_streaming branch 4 times, most recently from 72b2505 to 0f270a7 Compare April 7, 2025 14:17
Copy link
Collaborator

@mkopcins mkopcins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolve conflicts, I will test it tomorrow, but looks ok in general

@mkopcins mkopcins self-assigned this May 7, 2025
<!-- Provide a concise and descriptive summary of the changes
implemented in this PR. -->

- [ ] Bug fix (non-breaking change which fixes an issue)
- [x] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing
functionality to not work as expected)
- [ ] Documentation update (improves or adds clarity to existing
documentation)

- [x] iOS
- [x] Android

<!-- Provide step-by-step instructions on how to test your changes.
Include setup details if necessary. -->

<!-- Add screenshots here, if applicable -->

<!-- Link related issues here using #issue-number -->

- [x] I have performed a self-review of my code
- [ ] I have commented my code, particularly in hard-to-understand areas
- [ ] I have updated the documentation accordingly
- [ ] My changes generate no new warnings

<!-- Include any additional information, assumptions, or context that
reviewers might need to understand this PR. -->
@mkopcins mkopcins force-pushed the @md/s2t_streaming branch 2 times, most recently from 86cde69 to f6f8f31 Compare May 15, 2025 12:13
improved error handling, minor refactoring
@mkopcins mkopcins force-pushed the @md/s2t_streaming branch from f6f8f31 to a4c262a Compare May 15, 2025 12:15
@mkopcins mkopcins force-pushed the @md/s2t_streaming branch from ddec73b to 135eb72 Compare May 15, 2025 12:46
Copy link
Collaborator

@pweglik pweglik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions, but my main issue is including stuff from speech to text in llm example

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need audio streaming in LLM example app? Maybe it would be more suitable in speech-to-text app? Right now, over half of the file has nothing to do with our LLM feature that it should showcase

return '';
await this.encode(chunk);
} catch (error) {
this.onErrorCallback?.(new Error(getError(error) + ' encoding error'));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This regards all this.onErrorCallback calls in this file. What if dev doesn't pass onErrorCallback in constructor? It would make all those errors disappear into thin air. We may use something like suggested here:
#183
Maybe even define this.errorCallback in constructor like:

this.errorCallback = () => {
if (this.errorCallback) {
    this.errorCallback(getError(e));
  } else {
    throw new Error(getError(e));
  }

or some variation. We may also require to pass errorCallback or ignore this issue altogheter but I think it may result in poor DX

Comment on lines +258 to +277
private trimLeft(numOfTokensToSlice: number) {
for (let idx = 1; idx < this.seqs.length; idx++) {
if (this.seqs[idx]![0] === this.config.tokenizer.bos)
this.seqs[idx] = this.seqs[idx]!.slice(numOfTokensToSlice);
}
}

private trimRight(numOfTokensToSlice: number) {
for (let idx = 0; idx < this.seqs.length - 1; idx++) {
if (this.seqs[idx]!.at(-1) === this.config.tokenizer.eos)
this.seqs[idx] = this.seqs[idx]!.slice(0, -numOfTokensToSlice);
}
}

private async trimSequences(audioLanguage?: string) {
const numSpecialTokens = (await this.getStartingTokenIds(audioLanguage))
.length;
this.trimLeft(numSpecialTokens + NUM_TOKENS_TO_SLICE);
this.trimRight(numSpecialTokens + NUM_TOKENS_TO_SLICE);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We call trimSeqence in loop where we push new seq. Wouldn't it be enough then:

private trimLeft(newSeq, numOfTokensToSlice: number) {
      if (newSeq[0] === this.config.tokenizer.bos)
        newSeq = newSeq.slice(numOfTokensToSlice);
  }

  private trimRight(newSeq, numOfTokensToSlice: number) {
      if (newSeq.at(-1) === this.config.tokenizer.eos)
        newSeq = newSeq.slice(0, -numOfTokensToSlice);
  }

  private async trimSequences(newSeq: number[], audioLanguage?: string) {
    const numSpecialTokens = (await this.getStartingTokenIds(audioLanguage))
      .length;
    this.trimLeft(newSeq, numSpecialTokens + NUM_TOKENS_TO_SLICE);
    this.trimRight(newSeq, numSpecialTokens + NUM_TOKENS_TO_SLICE);
  }

Almost feel like you can do without trimLeft and trimRight - just put it all in one function

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code is not strictly correct since we trimRight the second to last seq and trim left the last seq, but yes, I can simplify this

this.waveform = this.waveform.slice(
this.windowSize - this.overlapSeconds * Number(this.seqs.length === 0)
);
const seq = await this.decodeChunk(chunk, audioLanguage);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We decode it online. What happens if decoding is slower than new samples coming in? Is it realistic case or it never happens?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants