Skip to content

Commit d93c08b

Browse files
mdydekMateusz Kopcińskimkopcins
authored
feat: streaming in speech to text transcription (#168)
## Description <!-- Provide a concise and descriptive summary of the changes implemented in this PR. --> ### Type of change - [ ] Bug fix (non-breaking change which fixes an issue) - [x] New feature (non-breaking change which adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected) - [ ] Documentation update (improves or adds clarity to existing documentation) ### Tested on - [x] iOS - [ ] Android ### Testing instructions <!-- Provide step-by-step instructions on how to test your changes. Include setup details if necessary. --> ### Screenshots <!-- Add screenshots here, if applicable --> ### Related issues <!-- Link related issues here using #issue-number --> ### Checklist - [x] I have performed a self-review of my code - [x] I have commented my code, particularly in hard-to-understand areas - [ ] I have updated the documentation accordingly - [ ] My changes generate no new warnings ### Additional notes <!-- Include any additional information, assumptions, or context that reviewers might need to understand this PR. --> --------- Co-authored-by: Mateusz Kopciński <[email protected]> Co-authored-by: Mateusz Kopcinski <[email protected]>
1 parent 39ac729 commit d93c08b

File tree

18 files changed

+1720
-2954
lines changed

18 files changed

+1720
-2954
lines changed

docs/docs/natural-language-processing/useSpeechToText.md

Lines changed: 103 additions & 10 deletions
Large diffs are not rendered by default.

docs/docs/typescript-api/SpeechToTextModule.md

Lines changed: 14 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -37,20 +37,27 @@ const transcribedText = await SpeechToTextModule.transcribe(waveform);
3737

3838
### Methods
3939

40-
| Method | Type | Description |
41-
| -------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
42-
| `load` | <code>(modelName: 'whisper' &#124 'moonshine' &#124 'whisperMultilingual', transcribeCallback?: (sequence: string) => void, modelDownloadProgressCallback?: (downloadProgress: number) => void, encoderSource?: ResourceSource, decoderSource?: ResourceSource, tokenizerSource?: ResourceSource)</code> | Loads the model specified with `modelName`, where `encoderSource`, `decoderSource`, `tokenizerSource` are strings specifying the location of the binaries for the models. `modelDownloadProgressCallback` allows you to monitor the current progress of the model download, while `transcribeCallback` is invoked with each generated token |
43-
| `transcribe` | `(waveform: number[], audioLanguage?: SpeechToTextLanguage): Promise<string>` | Starts a transcription process for a given input array, which should be a waveform at 16kHz. Resolves a promise with the output transcription when the model is finished. For multilingual models, you have to specify the audioLanguage flag, which is the language of the spoken language in the audio. |
44-
| `encode` | `(waveform: number[]) => Promise<number[]>` | Runs the encoding part of the model. Returns a float array representing the output of the encoder. |
45-
| `decode` | `(tokens: number[], encodings?: number[]) => Promise<number[]>` | Runs the decoder of the model. Returns a single token representing a next token in the output sequence. If `encodings` are provided then they are used for decoding process, if not then the cached encodings from most recent `encode` call are used. The cached option is much faster due to very large overhead for communication between native and react layers. |
46-
| `configureStreaming` | <code>(overlapSeconds?: number, windowSize?: number, streamingConfig?: 'fast' &#124; 'balanced' &#124; 'quality') => void</code> | Configures options for the streaming algorithm: <ul><li>`overlapSeconds` determines how much adjacent audio chunks overlap (increasing it slows down transcription, decreases probability of weird wording at the chunks intersection, setting it larger than 3 seconds generally is discouraged), </li><li>`windowSize` describes size of the audio chunks (increasing it speeds up the end to end transcription time, but increases latency for the first token to be returned),</li><li> `streamingConfig` predefined configs for `windowSize` and `overlapSeconds` values.</li></ul> Keep `windowSize + 2 * overlapSeconds <= 30`. |
40+
| Method | Type | Description |
41+
| --------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
42+
| `load` | <code>(modelName: 'whisper' &#124 'moonshine' &#124 'whisperMultilingual', transcribeCallback?: (sequence: string) => void, modelDownloadProgressCallback?: (downloadProgress: number) => void, encoderSource?: ResourceSource, decoderSource?: ResourceSource, tokenizerSource?: ResourceSource)</code> | Loads the model specified with `modelName`, where `encoderSource`, `decoderSource`, `tokenizerSource` are strings specifying the location of the binaries for the models. `modelDownloadProgressCallback` allows you to monitor the current progress of the model download, while `transcribeCallback` is invoked with each generated token |
43+
| `transcribe` | `(waveform: number[], audioLanguage?: SpeechToTextLanguage): Promise<string>` | Starts a transcription process for a given input array, which should be a waveform at 16kHz. Resolves a promise with the output transcription when the model is finished. For multilingual models, you have to specify the audioLanguage flag, which is the language of the spoken language in the audio. |
44+
| `streamingTranscribe` | `(streamingAction: STREAMING_ACTION, waveform?: number[], audioLanguage?: SpeechToTextLanguage) => Promise<string>` | This allows for running transcription process on-line, which means where the whole audio is not known beforehand i.e. when transcribing from a live microphone feed. `streamingAction` defines the type of package sent to the model: <li>`START` - initializes the process, allows for optional `waveform` data</li><li>`DATA` - this package should contain consecutive audio data chunks sampled in 16k Hz</li><li>`STOP` - the last data chunk for this transcription, ends the transcription process and flushes internal buffers</li> Each call returns most recent transcription. Returns error when called when module is in use (i.e. processing `transcribe` call) |
45+
| `encode` | `(waveform: number[]) => Promise<number[]>` | Runs the encoding part of the model. Returns a float array representing the output of the encoder. |
46+
| `decode` | `(tokens: number[], encodings?: number[]) => Promise<number[]>` | Runs the decoder of the model. Returns a single token representing a next token in the output sequence. If `encodings` are provided then they are used for decoding process, if not then the cached encodings from most recent `encode` call are used. The cached option is much faster due to very large overhead for communication between native and react layers. |
47+
| `configureStreaming` | <code>(overlapSeconds?: number, windowSize?: number, streamingConfig?: 'fast' &#124; 'balanced' &#124; 'quality') => void</code> | Configures options for the streaming algorithm: <ul><li>`overlapSeconds` determines how much adjacent audio chunks overlap (increasing it slows down transcription, decreases probability of weird wording at the chunks intersection, setting it larger than 3 seconds generally is discouraged), </li><li>`windowSize` describes size of the audio chunks (increasing it speeds up the end to end transcription time, but increases latency for the first token to be returned),</li><li> `streamingConfig` predefined configs for `windowSize` and `overlapSeconds` values.</li></ul> Keep `windowSize + 2 * overlapSeconds <= 30`. |
4748

4849
<details>
4950
<summary>Type definitions</summary>
5051

5152
```typescript
5253
type ResourceSource = string | number | object;
5354

55+
enum STREAMING_ACTION {
56+
START,
57+
DATA,
58+
STOP,
59+
}
60+
5461
enum SpeechToTextLanguage {
5562
Afrikaans = 'af',
5663
Albanian = 'sq',

examples/llm/App.tsx

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,9 +14,11 @@ import {
1414
} from 'react-native';
1515
import LLMScreen from './screens/LLMScreen';
1616
import LLMToolCallingScreen from './screens/LLMToolCallingScreen';
17+
import VoiceChatScreen from './screens/VocieChatScreen';
1718

1819
enum Mode {
1920
LLM,
21+
LLM_VOICE_CHAT,
2022
LLM_TOOL_CALLING,
2123
}
2224

@@ -39,6 +41,9 @@ export default function App() {
3941
case Mode.LLM:
4042
return <LLMScreen setIsGenerating={setIsGenerating} />;
4143

44+
case Mode.LLM_VOICE_CHAT:
45+
return <VoiceChatScreen setIsGenerating={setIsGenerating} />;
46+
4247
case Mode.LLM_TOOL_CALLING:
4348
return <LLMToolCallingScreen setIsGenerating={setIsGenerating} />;
4449

@@ -61,7 +66,7 @@ export default function App() {
6166
{!isGenerating ? (
6267
<View style={styles.wheelPickerContainer}>
6368
<ScrollPicker
64-
dataSource={['Chat with LLM', 'Tool calling']}
69+
dataSource={['Chat with LLM', 'Talk to LLM', 'Tool calling']}
6570
onValueChange={(_, selectedIndex) => {
6671
handleModeChange(selectedIndex);
6772
}}
Lines changed: 4 additions & 0 deletions
Loading
Lines changed: 4 additions & 0 deletions
Loading

0 commit comments

Comments
 (0)