-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Evaluation details about Qwen2-VL-72B #7
Comments
We use 48 frames for each video. |
Thank you for your response. Are you uniformly sampling 48 frames? I am wondering for tasks like temporal grounding, are 48 frames enough for precise localization. |
Yes, we uniformly sample 48 frames. We used 48 frames because we found that the maximum number of frames we can input is only 48 frames, and exceeding 48 frames can easily cause memory explosion. |
Just double check, are you sampling 48 frames from the hour-long video or I misunderstood the answer. This mean you only use 1 frame at every 1-2 minutes video clip which sounds confusing to me, given that the temporal grounding task has second-level annotations. |
Yes, the Qwen2-VL-72B model samples an average of 48 frames per video. The main reason for this is that we cannot feed more frames into the deployed model. You can submit the evaluation results to us if you can evaluate Qwen2-VL-72B using more frames. After reviewing, we can update the leaderboard with the new results. |
thanks for the quick response! |
So it seems quite doultful for the Qwen2-VL-72B results on these second-level tasks. If most part of the movie content is dropped after downsampling, I guess there is no better explanations than “hallucination” for the model outputs same as the ground truth |
Thank you for your interesting work!
For Qwen2-VL-72B model, are you using the whole video or just sample 48 frames per video? If you used the whole video, what's the fps?
The text was updated successfully, but these errors were encountered: