Evaluation details about Qwen2-VL-72B #7

lan-lw · 2024-10-16T20:40:12Z

Thank you for your interesting work!

For Qwen2-VL-72B model, are you using the whole video or just sample 48 frames per video? If you used the whole video, what's the fps?

huangshiyu13 · 2024-10-17T02:08:28Z

We use 48 frames for each video.

lan-lw · 2024-10-17T20:47:23Z

Thank you for your response.

Are you uniformly sampling 48 frames? I am wondering for tasks like temporal grounding, are 48 frames enough for precise localization.

huangshiyu13 · 2024-10-18T01:41:54Z

Thank you for your response.

Are you uniformly sampling 48 frames? I am wondering for tasks like temporal grounding, are 48 frames enough for precise localization.

Yes, we uniformly sample 48 frames. We used 48 frames because we found that the maximum number of frames we can input is only 48 frames, and exceeding 48 frames can easily cause memory explosion.

IssacCyj · 2024-10-18T03:40:38Z

Thank you for your response.
Are you uniformly sampling 48 frames? I am wondering for tasks like temporal grounding, are 48 frames enough for precise localization.

Yes, we uniformly sample 48 frames. We used 48 frames because we found that the maximum number of frames we can input is only 48 frames, and exceeding 48 frames can easily cause memory explosion.

Just double check, are you sampling 48 frames from the hour-long video or I misunderstood the answer. This mean you only use 1 frame at every 1-2 minutes video clip which sounds confusing to me, given that the temporal grounding task has second-level annotations.

huangshiyu13 · 2024-10-18T05:00:58Z

Thank you for your response.
Are you uniformly sampling 48 frames? I am wondering for tasks like temporal grounding, are 48 frames enough for precise localization.

Yes, we uniformly sample 48 frames. We used 48 frames because we found that the maximum number of frames we can input is only 48 frames, and exceeding 48 frames can easily cause memory explosion.

Just double check, are you sampling 48 frames from the hour-long video or I misunderstood the answer. This mean you only use 1 frame at every 1-2 minutes video clip which sounds confusing to me, given that the temporal grounding task has second-level annotations.

Yes, the Qwen2-VL-72B model samples an average of 48 frames per video. The main reason for this is that we cannot feed more frames into the deployed model. You can submit the evaluation results to us if you can evaluate Qwen2-VL-72B using more frames. After reviewing, we can update the leaderboard with the new results.

IssacCyj · 2024-10-18T05:02:47Z

thanks for the quick response!

IceFlameWorm · 2024-11-20T08:26:48Z

Thank you for your response.
Are you uniformly sampling 48 frames? I am wondering for tasks like temporal grounding, are 48 frames enough for precise localization.

Yes, we uniformly sample 48 frames. We used 48 frames because we found that the maximum number of frames we can input is only 48 frames, and exceeding 48 frames can easily cause memory explosion.

Just double check, are you sampling 48 frames from the hour-long video or I misunderstood the answer. This mean you only use 1 frame at every 1-2 minutes video clip which sounds confusing to me, given that the temporal grounding task has second-level annotations.

Yes, the Qwen2-VL-72B model samples an average of 48 frames per video. The main reason for this is that we cannot feed more frames into the deployed model. You can submit the evaluation results to us if you can evaluate Qwen2-VL-72B using more frames. After reviewing, we can update the leaderboard with the new results.

So it seems quite doultful for the Qwen2-VL-72B results on these second-level tasks. If most part of the movie content is dropped after downsampling, I guess there is no better explanations than “hallucination” for the model outputs same as the ground truth

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation details about Qwen2-VL-72B #7

Evaluation details about Qwen2-VL-72B #7

lan-lw commented Oct 16, 2024

huangshiyu13 commented Oct 17, 2024

lan-lw commented Oct 17, 2024

huangshiyu13 commented Oct 18, 2024

IssacCyj commented Oct 18, 2024

huangshiyu13 commented Oct 18, 2024

IssacCyj commented Oct 18, 2024

IceFlameWorm commented Nov 20, 2024

Evaluation details about Qwen2-VL-72B #7

Evaluation details about Qwen2-VL-72B #7

Comments

lan-lw commented Oct 16, 2024

huangshiyu13 commented Oct 17, 2024

lan-lw commented Oct 17, 2024

huangshiyu13 commented Oct 18, 2024

IssacCyj commented Oct 18, 2024

huangshiyu13 commented Oct 18, 2024

IssacCyj commented Oct 18, 2024

IceFlameWorm commented Nov 20, 2024