'waveform' must be provided as a (channel, time) torch Tensor. #363

xiaobai40009 · 2025-01-12T00:52:42Z

note: You will see Progress if working correctly
WhisperX processing error: 'waveform' must be provided as a (channel, time) torch Tensor.

bmjlgenhao2 · 2025-02-07T06:03:11Z

This might be because VideoLingo splits the input video into many smaller segments, each with a duration of no more than 30 minutes (1800 seconds), and then utilizes Whisper for recognition. However, when your video’s duration is close to 1800 seconds, the second segment generated from the split might contain no audio at all, causing Whisper to fail in recognition. Here is the error message I encountered:

▶️ Starting WhisperX for segment 1821.55s to 1821.64s...
📥 Using WHISPER model from HuggingFace: large-v3 ...
**You can ignore warning of `Model was trained with torch 1.10.0+cu102, yours is
2.0.0+cu118...`**
Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.3.3. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint ../../.local/lib/python3.10/site-packages/whisperx/assets/pytorch_model.bin`
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.1.2. Bad things might happen unless you revert torch to 1.x.
note: You will see Progress if working correctly
WhisperX processing error: 'waveform' must be provided as a (channel, time) 
torch Tensor.
2025-02-07 12:30:07.880 Uncaught app execution
Traceback (most recent call last):
  File "/opt/anaconda3/envs/videolingo/lib/python3.10/site-packages/streamlit/runtime/scriptrunner/exec_code.py", line 121, in exec_func_with_error_handling
    result = func()
  File "/opt/anaconda3/envs/videolingo/lib/python3.10/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 591, in code_to_exec
    exec(code, module.__dict__)
  File "/Users/derickqin/Projects/VideoLingo/st.py", line 124, in <module>
    main()
  File "/Users/derickqin/Projects/VideoLingo/st.py", line 120, in main
    text_processing_section()
  File "/Users/derickqin/Projects/VideoLingo/st.py", line 33, in text_processing_section
    process_text()
  File "/Users/derickqin/Projects/VideoLingo/st.py", line 47, in process_text
    step2_whisperX.transcribe()
  File "/Users/derickqin/Projects/VideoLingo/core/step2_whisperX.py", line 64, in transcribe
    result = ts(whisper_audio, start, end)
  File "/Users/derickqin/Projects/VideoLingo/core/all_whisper_methods/whisperX_local.py", line 103, in transcribe_audio
    result = model.transcribe(audio_segment, batch_size=batch_size, print_progress=True)
  File "/Users/derickqin/.local/lib/python3.10/site-packages/whisperx/asr.py", line 186, in transcribe
    vad_segments = self.vad_model({"waveform": torch.from_numpy(audio).unsqueeze(0), "sample_rate": SAMPLE_RATE})
  File "/Users/derickqin/.local/lib/python3.10/site-packages/pyannote/audio/core/pipeline.py", line 320, in __call__
    file = Audio.validate_file(file)
  File "/Users/derickqin/.local/lib/python3.10/site-packages/pyannote/audio/core/io.py", line 155, in validate_file
    raise ValueError(
ValueError: 'waveform' must be provided as a (channel, time) torch Tensor.

Before an official fix is released, I’d like to offer a temporary solution. In the core\all_whisper_methods\audio_preprocess.py
You can try to modify

def split_audio(audio_file: str, target_len: int = 30*60, win: int = 60) -> List[Tuple[float, float]]:
# 30 min 16000 Hz 96kbps ~ 22MB < 25MB required by whisper
    print("[bold blue]🔪 Starting audio segmentation...[/]")

to

def split_audio(audio_file: str, target_len: int = 20*60, win: int = 60) -> List[Tuple[float, float]]:
    # 30 min 16000 Hz 96kbps ~ 22MB < 25MB required by whisper
    print("[bold blue]🔪 Starting audio segmentation...[/]")

In that case, simply change the target_len from 30*60 to 20*60,splitting the video into 20-minute segments. However, it’s important to note that if your input video is exactly 20 minutes long, this modification could still potentially lead to errors.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

'waveform' must be provided as a (channel, time) torch Tensor. #363

'waveform' must be provided as a (channel, time) torch Tensor. #363

xiaobai40009 commented Jan 12, 2025

bmjlgenhao2 commented Feb 7, 2025

'waveform' must be provided as a (channel, time) torch Tensor. #363

'waveform' must be provided as a (channel, time) torch Tensor. #363

Comments

xiaobai40009 commented Jan 12, 2025

bmjlgenhao2 commented Feb 7, 2025