Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluation details about Qwen2-VL-72B #7

Open
lan-lw opened this issue Oct 16, 2024 · 7 comments
Open

Evaluation details about Qwen2-VL-72B #7

lan-lw opened this issue Oct 16, 2024 · 7 comments

Comments

@lan-lw
Copy link

lan-lw commented Oct 16, 2024

Thank you for your interesting work!

For Qwen2-VL-72B model, are you using the whole video or just sample 48 frames per video? If you used the whole video, what's the fps?

@huangshiyu13
Copy link
Member

We use 48 frames for each video.

@lan-lw
Copy link
Author

lan-lw commented Oct 17, 2024

Thank you for your response.

Are you uniformly sampling 48 frames? I am wondering for tasks like temporal grounding, are 48 frames enough for precise localization.

@huangshiyu13
Copy link
Member

Thank you for your response.

Are you uniformly sampling 48 frames? I am wondering for tasks like temporal grounding, are 48 frames enough for precise localization.

Yes, we uniformly sample 48 frames. We used 48 frames because we found that the maximum number of frames we can input is only 48 frames, and exceeding 48 frames can easily cause memory explosion.

@IssacCyj
Copy link

Thank you for your response.
Are you uniformly sampling 48 frames? I am wondering for tasks like temporal grounding, are 48 frames enough for precise localization.

Yes, we uniformly sample 48 frames. We used 48 frames because we found that the maximum number of frames we can input is only 48 frames, and exceeding 48 frames can easily cause memory explosion.

Just double check, are you sampling 48 frames from the hour-long video or I misunderstood the answer. This mean you only use 1 frame at every 1-2 minutes video clip which sounds confusing to me, given that the temporal grounding task has second-level annotations.

@huangshiyu13
Copy link
Member

Thank you for your response.
Are you uniformly sampling 48 frames? I am wondering for tasks like temporal grounding, are 48 frames enough for precise localization.

Yes, we uniformly sample 48 frames. We used 48 frames because we found that the maximum number of frames we can input is only 48 frames, and exceeding 48 frames can easily cause memory explosion.

Just double check, are you sampling 48 frames from the hour-long video or I misunderstood the answer. This mean you only use 1 frame at every 1-2 minutes video clip which sounds confusing to me, given that the temporal grounding task has second-level annotations.

Yes, the Qwen2-VL-72B model samples an average of 48 frames per video. The main reason for this is that we cannot feed more frames into the deployed model. You can submit the evaluation results to us if you can evaluate Qwen2-VL-72B using more frames. After reviewing, we can update the leaderboard with the new results.

@IssacCyj
Copy link

thanks for the quick response!

@IceFlameWorm
Copy link

Thank you for your response.
Are you uniformly sampling 48 frames? I am wondering for tasks like temporal grounding, are 48 frames enough for precise localization.

Yes, we uniformly sample 48 frames. We used 48 frames because we found that the maximum number of frames we can input is only 48 frames, and exceeding 48 frames can easily cause memory explosion.

Just double check, are you sampling 48 frames from the hour-long video or I misunderstood the answer. This mean you only use 1 frame at every 1-2 minutes video clip which sounds confusing to me, given that the temporal grounding task has second-level annotations.

Yes, the Qwen2-VL-72B model samples an average of 48 frames per video. The main reason for this is that we cannot feed more frames into the deployed model. You can submit the evaluation results to us if you can evaluate Qwen2-VL-72B using more frames. After reviewing, we can update the leaderboard with the new results.

So it seems quite doultful for the Qwen2-VL-72B results on these second-level tasks. If most part of the movie content is dropped after downsampling, I guess there is no better explanations than “hallucination” for the model outputs same as the ground truth

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants