Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'waveform' must be provided as a (channel, time) torch Tensor. #363

Open
xiaobai40009 opened this issue Jan 12, 2025 · 1 comment
Open

Comments

@xiaobai40009
Copy link

note: You will see Progress if working correctly
WhisperX processing error: 'waveform' must be provided as a (channel, time) torch Tensor.

@bmjlgenhao2
Copy link

This might be because VideoLingo splits the input video into many smaller segments, each with a duration of no more than 30 minutes (1800 seconds), and then utilizes Whisper for recognition. However, when your video’s duration is close to 1800 seconds, the second segment generated from the split might contain no audio at all, causing Whisper to fail in recognition. Here is the error message I encountered:

▶️ Starting WhisperX for segment 1821.55s to 1821.64s...
📥 Using WHISPER model from HuggingFace: large-v3 ...
**You can ignore warning of `Model was trained with torch 1.10.0+cu102, yours is
2.0.0+cu118...`**
Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.3.3. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint ../../.local/lib/python3.10/site-packages/whisperx/assets/pytorch_model.bin`
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.1.2. Bad things might happen unless you revert torch to 1.x.
note: You will see Progress if working correctly
WhisperX processing error: 'waveform' must be provided as a (channel, time) 
torch Tensor.
2025-02-07 12:30:07.880 Uncaught app execution
Traceback (most recent call last):
  File "/opt/anaconda3/envs/videolingo/lib/python3.10/site-packages/streamlit/runtime/scriptrunner/exec_code.py", line 121, in exec_func_with_error_handling
    result = func()
  File "/opt/anaconda3/envs/videolingo/lib/python3.10/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 591, in code_to_exec
    exec(code, module.__dict__)
  File "/Users/derickqin/Projects/VideoLingo/st.py", line 124, in <module>
    main()
  File "/Users/derickqin/Projects/VideoLingo/st.py", line 120, in main
    text_processing_section()
  File "/Users/derickqin/Projects/VideoLingo/st.py", line 33, in text_processing_section
    process_text()
  File "/Users/derickqin/Projects/VideoLingo/st.py", line 47, in process_text
    step2_whisperX.transcribe()
  File "/Users/derickqin/Projects/VideoLingo/core/step2_whisperX.py", line 64, in transcribe
    result = ts(whisper_audio, start, end)
  File "/Users/derickqin/Projects/VideoLingo/core/all_whisper_methods/whisperX_local.py", line 103, in transcribe_audio
    result = model.transcribe(audio_segment, batch_size=batch_size, print_progress=True)
  File "/Users/derickqin/.local/lib/python3.10/site-packages/whisperx/asr.py", line 186, in transcribe
    vad_segments = self.vad_model({"waveform": torch.from_numpy(audio).unsqueeze(0), "sample_rate": SAMPLE_RATE})
  File "/Users/derickqin/.local/lib/python3.10/site-packages/pyannote/audio/core/pipeline.py", line 320, in __call__
    file = Audio.validate_file(file)
  File "/Users/derickqin/.local/lib/python3.10/site-packages/pyannote/audio/core/io.py", line 155, in validate_file
    raise ValueError(
ValueError: 'waveform' must be provided as a (channel, time) torch Tensor.

Before an official fix is released, I’d like to offer a temporary solution. In the core\all_whisper_methods\audio_preprocess.py
You can try to modify

def split_audio(audio_file: str, target_len: int = 30*60, win: int = 60) -> List[Tuple[float, float]]:
# 30 min 16000 Hz 96kbps ~ 22MB < 25MB required by whisper
    print("[bold blue]🔪 Starting audio segmentation...[/]")

to

def split_audio(audio_file: str, target_len: int = 20*60, win: int = 60) -> List[Tuple[float, float]]:
    # 30 min 16000 Hz 96kbps ~ 22MB < 25MB required by whisper
    print("[bold blue]🔪 Starting audio segmentation...[/]")

In that case, simply change the target_len from 30*60 to 20*60,splitting the video into 20-minute segments. However, it’s important to note that if your input video is exactly 20 minutes long, this modification could still potentially lead to errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
@bmjlgenhao2 @xiaobai40009 and others