You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat: streaming in speech to text transcription (#168)
## Description
<!-- Provide a concise and descriptive summary of the changes
implemented in this PR. -->
### Type of change
- [ ] Bug fix (non-breaking change which fixes an issue)
- [x] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing
functionality to not work as expected)
- [ ] Documentation update (improves or adds clarity to existing
documentation)
### Tested on
- [x] iOS
- [ ] Android
### Testing instructions
<!-- Provide step-by-step instructions on how to test your changes.
Include setup details if necessary. -->
### Screenshots
<!-- Add screenshots here, if applicable -->
### Related issues
<!-- Link related issues here using #issue-number -->
### Checklist
- [x] I have performed a self-review of my code
- [x] I have commented my code, particularly in hard-to-understand areas
- [ ] I have updated the documentation accordingly
- [ ] My changes generate no new warnings
### Additional notes
<!-- Include any additional information, assumptions, or context that
reviewers might need to understand this PR. -->
---------
Co-authored-by: Mateusz Kopciński <[email protected]>
Co-authored-by: Mateusz Kopcinski <[email protected]>
|`load`| <code>(modelName: 'whisper' | 'moonshine' | 'whisperMultilingual', transcribeCallback?: (sequence: string) => void, modelDownloadProgressCallback?: (downloadProgress: number) => void, encoderSource?: ResourceSource, decoderSource?: ResourceSource, tokenizerSource?: ResourceSource)</code> | Loads the model specified with `modelName`, where `encoderSource`, `decoderSource`, `tokenizerSource` are strings specifying the location of the binaries for the models. `modelDownloadProgressCallback` allows you to monitor the current progress of the model download, while `transcribeCallback` is invoked with each generated token |
43
-
|`transcribe`|`(waveform: number[], audioLanguage?: SpeechToTextLanguage): Promise<string>`| Starts a transcription process for a given input array, which should be a waveform at 16kHz. Resolves a promise with the output transcription when the model is finished. For multilingual models, you have to specify the audioLanguage flag, which is the language of the spoken language in the audio. |
44
-
|`encode`|`(waveform: number[]) => Promise<number[]>`| Runs the encoding part of the model. Returns a float array representing the output of the encoder. |
45
-
|`decode`|`(tokens: number[], encodings?: number[]) => Promise<number[]>`| Runs the decoder of the model. Returns a single token representing a next token in the output sequence. If `encodings` are provided then they are used for decoding process, if not then the cached encodings from most recent `encode` call are used. The cached option is much faster due to very large overhead for communication between native and react layers. |
46
-
|`configureStreaming`| <code>(overlapSeconds?: number, windowSize?: number, streamingConfig?: 'fast' | 'balanced' | 'quality') => void</code> | Configures options for the streaming algorithm: <ul><li>`overlapSeconds` determines how much adjacent audio chunks overlap (increasing it slows down transcription, decreases probability of weird wording at the chunks intersection, setting it larger than 3 seconds generally is discouraged), </li><li>`windowSize` describes size of the audio chunks (increasing it speeds up the end to end transcription time, but increases latency for the first token to be returned),</li><li> `streamingConfig` predefined configs for `windowSize` and `overlapSeconds` values.</li></ul> Keep `windowSize + 2 * overlapSeconds <= 30`. |
|`load`| <code>(modelName: 'whisper' | 'moonshine' | 'whisperMultilingual', transcribeCallback?: (sequence: string) => void, modelDownloadProgressCallback?: (downloadProgress: number) => void, encoderSource?: ResourceSource, decoderSource?: ResourceSource, tokenizerSource?: ResourceSource)</code> | Loads the model specified with `modelName`, where `encoderSource`, `decoderSource`, `tokenizerSource` are strings specifying the location of the binaries for the models. `modelDownloadProgressCallback` allows you to monitor the current progress of the model download, while `transcribeCallback` is invoked with each generated token |
43
+
|`transcribe`|`(waveform: number[], audioLanguage?: SpeechToTextLanguage): Promise<string>`| Starts a transcription process for a given input array, which should be a waveform at 16kHz. Resolves a promise with the output transcription when the model is finished. For multilingual models, you have to specify the audioLanguage flag, which is the language of the spoken language in the audio. |
44
+
|`streamingTranscribe`|`(streamingAction: STREAMING_ACTION, waveform?: number[], audioLanguage?: SpeechToTextLanguage) => Promise<string>`| This allows for running transcription process on-line, which means where the whole audio is not known beforehand i.e. when transcribing from a live microphone feed. `streamingAction` defines the type of package sent to the model: <li>`START` - initializes the process, allows for optional `waveform` data</li><li>`DATA` - this package should contain consecutive audio data chunks sampled in 16k Hz</li><li>`STOP` - the last data chunk for this transcription, ends the transcription process and flushes internal buffers</li> Each call returns most recent transcription. Returns error when called when module is in use (i.e. processing `transcribe` call) |
45
+
|`encode`|`(waveform: number[]) => Promise<number[]>`| Runs the encoding part of the model. Returns a float array representing the output of the encoder. |
46
+
|`decode`|`(tokens: number[], encodings?: number[]) => Promise<number[]>`| Runs the decoder of the model. Returns a single token representing a next token in the output sequence. If `encodings` are provided then they are used for decoding process, if not then the cached encodings from most recent `encode` call are used. The cached option is much faster due to very large overhead for communication between native and react layers. |
47
+
|`configureStreaming`| <code>(overlapSeconds?: number, windowSize?: number, streamingConfig?: 'fast' | 'balanced' | 'quality') => void</code> | Configures options for the streaming algorithm: <ul><li>`overlapSeconds` determines how much adjacent audio chunks overlap (increasing it slows down transcription, decreases probability of weird wording at the chunks intersection, setting it larger than 3 seconds generally is discouraged), </li><li>`windowSize` describes size of the audio chunks (increasing it speeds up the end to end transcription time, but increases latency for the first token to be returned),</li><li> `streamingConfig` predefined configs for `windowSize` and `overlapSeconds` values.</li></ul> Keep `windowSize + 2 * overlapSeconds <= 30`. |
0 commit comments