This script modifies and adds more robust decoding logic on top of OpenAI's Whisper to produce more accurate segment-level timestamps and obtain to word-level timestamps without extra inference.
- Add function to stabilize with multiple inferences
- Add word timestamping (previously only token based)
pip install stable-ts
To install the lastest version:
pip install git+https://github.com/jianfch/stable-ts.git
Transcribe audio then save result as JSON file.
stable-ts audio.mp3 -o audio.json
Processing JSON file of the results into ASS.
stable-ts audio.json -o audio.ass
Transcribe multiple audio files then process the results directly into SRT files.
stable-ts audio1.mp3 audio2.mp3 audio3.mp3 -o audio1.srt audio2.srt audio3.srt
Show all available arguments and help.
stable-ts -h
import stable_whisper
model = stable_whisper.load_model('base')
# modified model should run just like the regular model but accepts additional parameters
results = model.transcribe('audio.mp3')
jfk_segment.mp4
# the above uses default settings on version 1.1 with large model
# sentence/phrase-level
stable_whisper.results_to_sentence_srt(results, 'audio.srt')
jfk_word_segments.mp4
# the above uses default settings on version 1.1 with large model
# sentence/phrase-level & word-level
stable_whisper.results_to_sentence_word_ass(results, 'audio.ass')
- Although timestamps are chronological, they can still very inaccurate depending on the model, audio, and parameters.
- To produce production ready word-level results, the model needs to be fine-tuned with high quality dataset of audio with word-level timestamp.
This project is licensed under the MIT License - see the LICENSE file for details
Includes slight modification of the original work: Whisper