Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job memory usage calculation -- (PSS calculation issue for new kernel) #11687

Closed
z4027163 opened this issue Aug 10, 2023 · 1 comment
Closed

Comments

@z4027163
Copy link

There are large failures regarding Run3 Rereco WFs running on some of the sites with newer kernel versions. Details: Failures in Run 3 data reprocessing

The detailed reason is in this ticket as well. In short, Linux v6.0+ includes an additional field in smaps, Pss_Dirty. It is added to the PSS calculation and therefore jobs get killed earlier due to PSS exceeding the threshold.

We would like to seek a solution, such as using RSS as the metric to kill the jobs in terms of memory usage. Or else the sites with new kernels will constantly overestimate the memory usage and end jobs earlier.

@todor-ivanov
Copy link
Contributor

Thanks for creating this issue @z4027163
There is already another WMCore issue related to the problem: #11667 and we already have provided two fixes for that:

The long term solution would require change of the mechanism on how we distribute runtime code, so even though this fix which is using the more robust psutil module is ready, it would not go in production until we converge on the best way to distribute the library at runtime.

I am closing this issue now and we should follow on the original one.

@amaltaro amaltaro closed this as not planned Won't fix, can't repro, duplicate, stale Aug 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants