Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse /proc/<pid>/smaps_rollup if present && Reduce string concatenaion operations #11676

Merged

Conversation

todor-ivanov
Copy link
Contributor

@todor-ivanov todor-ivanov commented Jul 25, 2023

Fixes #11667

Status

ready

Description

The issue #11667 can be solved in 3 different ways. This is the second out of 3 suggested fixes.

  • The current one parses /proc/<pid>/smaps_rollup file if present, (we need a kernel version 4.+ for this), and falls back to parsing /proc/<pid>/smaps file if the accumulated statistics file is missing.

According to the Linux kernel documentation page about /proc file system:

 smaps_rollup	Accumulated smaps stats for all mappings of the process.  This
		can be derived from smaps, but is faster and more convenient

Since this is a direct command execution at runtime, I was able to check the result directly line by line:
Here follows the full test I did manually for this change:

$ ipython 
Python 3.9.14 (main, Jan  9 2023, 00:00:00) 
Type 'copyright', 'credits' or 'license' for more information
IPython 8.13.2 -- An enhanced Interactive Python. Type '?' for help.
%reload_ext autoreload
%autoreload 2

In [1]: import os

In [2]: import WMCore.Algorithms.SubprocessAlgos as subprocessAlgos

In [3]: monitorBase = "pid=%i; ps -p $pid -o pid,ppid,rss,pcpu,pmem,cmd -ww | grep $pid"

In [4]: pssMemoryCommand = "pid=%i; awk '/^Pss:/ {print $2}' /proc/$pid/smaps_rollup 2>/dev/null || awk '/^Pss:/ {pss += $2} END {print pss}' /proc/$pid/smaps"

In [5]: subprocessAlgos.runCommand(monitorBase % os.getpid())
Out[5]: 
('   4044    2855 58376  1.4  3.6 /data/WMCore.venv3/bin/python /data/WMCore.venv3/bin/ipython\n',
 '',
 0)

In [6]: subprocessAlgos.runCommand(pssMemoryCommand % os.getpid())
Out[6]: ('51534\n', '', 0)

As one can see the so modified monitorBase and pssMemoryCommand lines, return exactly the same tuples as output, just like the original code, but are using the accumulated process statistics file instead. I have checked the values returned multiple time - they were the same regardless of which file we use.

Is it backward compatible (if not, which system it affects?)

YES

Related PRs

None

External dependencies / deployment changes

No

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 6 warnings and errors that must be fixed
    • 2 warnings
    • 28 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 3 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14373/artifact/artifacts/PullRequestReport.html

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good to me Todor and I think we should actually deploy this fix in the production agents while we keep considering the psutils approach.
I am going to patch vocms0192 to test this PR.

@todor-ivanov
Copy link
Contributor Author

todor-ivanov commented Jul 26, 2023

Thanks for the review and test @amaltaro
Please let me now about the results from the tests in vocms0192 and how we should proceed on patching the production agents.

@amaltaro
Copy link
Contributor

The 5 production agents running on 2.2.3.1 (RelVal + production) are getting this patch today.
In addition, we also need to backport it to the branch 2.2.3_wmagent, such that we can have this in in the next WMAgent patch release (likely 2.2.3.2)

@VinInn
Copy link

VinInn commented Jul 26, 2023

If the job is killed because it exceeds the memory (or time) limit it would be useful to dump as much info as possible in particular
/proc/<pid>/smaps_rollup and global info as well such as /proc/meminfo or vmstat -s

@todor-ivanov
Copy link
Contributor Author

Hi @VinInn
This is just a quick fix, intended to be used only for patching the currently running agents and to resolve the issues with what we already have in the production system queue (but not in condor unfortunately).

There is another PR #11677 (comment) which is supposed to give more detailed and better resource view based on psutil as you suggested. But we need to overcome the package distribution problem at runtime, because if possible, we'd prefer to avoid relying on the one provided with CMSSW due to the extra validation steps it would require. We are already looking into a better way for this and hopefully improving the WMCore runtime environment in near future as well. This all will be discussed during our next meeting.

FYI: @amaltaro

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

read smaps_rollup instead of parsing smaps using "awk"
4 participants