Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IndexError in Whisper model: Index out of bounds during token timestamp extraction #12

Open
GrahLnn opened this issue Nov 21, 2024 · 4 comments

Comments

@GrahLnn
Copy link

GrahLnn commented Nov 21, 2024

I tried to transcribe an hour-long audio, but I got this error. I had good results with a two-minute task attempt, so I wanted to try the long audio. Is there any way to fix it? Thank you.

def transcribe_audio(file_path):
    device = "cuda:0" if torch.cuda.is_available() else "cpu"
    print(f"{device=}")
    torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

    model_id = "nyrahealth/CrisperWhisper"

    model = AutoModelForSpeechSeq2Seq.from_pretrained(
        model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
    )
    model.to(device)

    processor = AutoProcessor.from_pretrained(model_id)

    pipe = pipeline(
        "automatic-speech-recognition",
        model=model,
        tokenizer=processor.tokenizer,
        feature_extractor=processor.feature_extractor,
        chunk_length_s=30,
        stride_length_s=4,
        batch_size=1,
        return_timestamps="word",
        torch_dtype=torch_dtype,
        device=device,
    )

    result = pipe(file_path)
    return result

and error

Traceback (most recent call last):
  File "C:\Users\grahlnn\test\CrisperWhisper.py", line 71, in <module>
    res = transcribe_audio(
          ^^^^^^^^^^^^^^^^^
  File "C:\Users\grahlnn\test\CrisperWhisper.py", line 66, in transcribe_audio
    result = pipe(file_path)
             ^^^^^^^^^^^^^^^
  File "C:\Users\grahlnn\test\.venv\Lib\site-packages\transformers\pipelines\automatic_speech_recognition.py", line 283, in __call__
    return super().__call__(inputs, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\grahlnn\test\.venv\Lib\site-packages\transformers\pipelines\base.py", line 1294, in __call__
    return next(
           ^^^^^
  File "C:\Users\grahlnn\test\.venv\Lib\site-packages\transformers\pipelines\pt_utils.py", line 124, in __next__
    item = next(self.iterator)
           ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\grahlnn\test\.venv\Lib\site-packages\transformers\pipelines\pt_utils.py", line 269, in __next__
    processed = self.infer(next(self.iterator), **self.params)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\grahlnn\test\.venv\Lib\site-packages\transformers\pipelines\base.py", line 1209, in forward
    model_outputs = self._forward(model_inputs, **forward_params)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\grahlnn\test\.venv\Lib\site-packages\transformers\pipelines\automatic_speech_recognition.py", line 515, in _forward
    tokens = self.model.generate(
             ^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\grahlnn\test\.venv\Lib\site-packages\transformers\models\whisper\generation_whisper.py", line 684, in generate
    ) = self.generate_with_fallback(
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\grahlnn\test\.venv\Lib\site-packages\transformers\models\whisper\generation_whisper.py", line 862, in generate_with_fallback
    seek_sequences, seek_outputs = self._postprocess_outputs(
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\grahlnn\test\.venv\Lib\site-packages\transformers\models\whisper\generation_whisper.py", line 963, in _postprocess_outputs
    seek_outputs["token_timestamps"] = self._extract_token_timestamps(
                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\grahlnn\test\.venv\Lib\site-packages\transformers\models\whisper\generation_whisper.py", line 221, in _extract_token_timestamps
    [
  File "C:\Users\grahlnn\test\.venv\Lib\site-packages\transformers\models\whisper\generation_whisper.py", line 222, in <listcomp>
    torch.index_select(weights[:, :, i, :], dim=0, index=beam_indices[:, i])
                       ~~~~~~~^^^^^^^^^^^^
IndexError: index 447 is out of bounds for dimension 2 with size 447
@LaurinmyReha
Copy link
Contributor

LaurinmyReha commented Nov 21, 2024

Hey,

the longform logic is something we will work on next since the transformers implementation is not ideal for our model.

However, hopefully for a quick fix you can try to install our custom fork and see if this fixes your problem:
pip install git+https://github.com/nyrahealth/transformers.git@crisper_whisper

If this does not do it let me know and we look into it further together.

Best,

Laurin

@GrahLnn
Copy link
Author

GrahLnn commented Nov 22, 2024

Thank you for your help, now there is a new error.

Traceback (most recent call last):
  File "C:\Users\grahl\criwhisper\test.py", line 76, in <module>
    res = transcribe_audio(
          ^^^^^^^^^^^^^^^^^
  File "C:\Users\grahl\criwhisper\test.py", line 71, in transcribe_audio
    result = pipe(file_path)
             ^^^^^^^^^^^^^^^
  File "C:\Users\grahl\criwhisper\.venv\Lib\site-packages\transformers\pipelines\automatic_speech_recognition.py", line 292, in __call__
    return super().__call__(inputs, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\grahl\criwhisper\.venv\Lib\site-packages\transformers\pipelines\base.py", line 1154, in __call__
    return next(
           ^^^^^
  File "C:\Users\grahl\criwhisper\.venv\Lib\site-packages\transformers\pipelines\pt_utils.py", line 124, in __next__
    item = next(self.iterator)
           ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\grahl\criwhisper\.venv\Lib\site-packages\transformers\pipelines\pt_utils.py", line 266, in __next__
    processed = self.infer(next(self.iterator), **self.params)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\grahl\criwhisper\.venv\Lib\site-packages\transformers\pipelines\base.py", line 1068, in forward
    model_outputs = self._forward(model_inputs, **forward_params)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\grahl\criwhisper\.venv\Lib\site-packages\transformers\pipelines\automatic_speech_recognition.py", line 507, in _forward
    tokens = self.model.generate(
             ^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\grahl\criwhisper\.venv\Lib\site-packages\transformers\models\whisper\generation_whisper.py", line 624, in generate
    outputs["token_timestamps"] = self._extract_token_timestamps(
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\grahl\criwhisper\.venv\Lib\site-packages\transformers\models\whisper\generation_whisper.py", line 316, in _extract_token_timestamps
    timestamps[batch_idx, 1:] = torch.tensor(jump_times)
    ~~~~~~~~~~^^^^^^^^^^^^^^^
RuntimeError: The expanded size of the tensor (4) must match the existing size (5) at non-singleton dimension 0.  Target sizes: [4].  Tensor sizes: [5]

@david-gimeno
Copy link

david-gimeno commented Nov 24, 2024

Hi @GrahLnn,

I experienced the same problem some time ago. After diving into the code, I could find a solution. Perhaps is not the best, but it works :') Step by step:

  1. Find out where your transformers package is located:
import transformers
print(transformers.__file__)
  1. Edit the script ${package_rootdir}/models/whisper/generation_whisper.py on line 316:
# print("SURPRISE:", timestamps.shape, batch_idx, torch.tensor(jump_times).shape)
if timestamps.shape[-1] == len(jump_times):
    timestamps[batch_idx, 0:] = torch.tensor(jump_times)
else:
    timestamps[batch_idx, 1:] = torch.tensor(jump_times)

Depending of the length of your audio file, for some reason i don't remember (or perhaps I never understood), the code fails. I hope this helps you :)

@GrahLnn
Copy link
Author

GrahLnn commented Nov 26, 2024

Thank you help @david-gimeno , It's working.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants