[WQ] enforce memory limits #3905

svandenhaute · 2024-07-23T09:15:54Z

svandenhaute
Jul 23, 2024

I'm dealing with a situation whereby a specific task requires (somewhat unexpectedly) much more memory than it should. Because of this, it triggers an OOM which completely kills the work_queue_worker process.
Is there a way to make this proceed more gracefully? Ideally, the OOM in one task does not kill all other tasks which are running on that particular worker...

svandenhaute · 2024-07-23T10:44:56Z

svandenhaute
Jul 23, 2024
Author

I've figured out a solution: cgroups.

In my particular case, it was especially easy. All tasks are executed within a container (apptainer or singularity), and the runtimes come with a --memory flag which determines the maximum amount of memory the container can use. If it gets exceeded, it will be killed 'gracefully' without affecting other running processes on the node.

I would close the issue, but perhaps this should be transformed into a discussion in order to keep it around.

0 replies

dthain · 2024-07-23T14:20:08Z

dthain
Jul 23, 2024
Maintainer

If a container solves your problem nicely, that's great.

For situations where a container cannot be deployed, we also have the Resource Monitor that can be enabled in WQ to observe, enforce, and report resource utilization.

0 replies

svandenhaute · 2024-07-23T14:29:33Z

svandenhaute
Jul 23, 2024
Author

The container was supposed to solve the problem, but only very recent versions of apptainer / singularity are apparently capable of working with cgroups. The one on my laptop works, and does what it's supposed to do, but the two clusters that I tested it with both didn't support it.

The next thing I tried was setting ulimit -v 1000 to e.g. restrict memory usage per process to 1MB. Unfortunately, this still leads to OOMs completely killing the work_queue_worker.

When enabling the resource monitor, the task still crashes the worker, instead of the worker killing the task...
See below for the debug log. There is 1 worker, which should execute four identical tasks, each of which will go OOM.

2024/07/23 17:26:07.58 work_queue_python[178502] tcp: listening on port 1026
2024/07/23 17:26:07.58 work_queue_python[178502] wq: Using default fast abort multiplier for 'default'.
2024/07/23 17:26:07.58 work_queue_python[178502] wq: workers connections -- known: 0, connecting: 0, available: 0.
2024/07/23 17:26:07.58 work_queue_python[178502] wq: Work Queue is listening on port 1026.
2024/07/23 17:26:07.58 work_queue_python[178502] rmonitor: locating resource monitor executable...
2024/07/23 17:26:07.58 work_queue_python[178502] rmonitor: trying executable at local directory.
2024/07/23 17:26:07.58 work_queue_python[178502] rmonitor: trying executable at PATH.
2024/07/23 17:26:07.58 work_queue_python[178502] rmonitor: monitoring process: 178502
2024/07/23 17:26:07.58 work_queue_python[178502] rmonitor: locating resource monitor executable...
2024/07/23 17:26:07.58 work_queue_python[178502] rmonitor: trying executable at local directory.
2024/07/23 17:26:07.58 work_queue_python[178502] rmonitor: trying executable at PATH.
2024/07/23 17:26:07.58 work_queue_python[178502] rmonitor: monitoring process: 178502
2024/07/23 17:26:07.58 work_queue_python[178502] wq: workers connections -- known: 0, connecting: 0, available: 0.
2024/07/23 17:26:07.58 work_queue_python[178502] wq: log enabled and is being written to /pfs/lustrep4/scratch/project_465000847/svandenhaute/test_psiflow/test_memory/psiflow_internal/000/CP2K/master_log
2024/07/23 17:26:07.58 work_queue_python[178502] wq: transactions log enabled and is being written to /pfs/lustrep4/scratch/project_465000847/svandenhaute/test_psiflow/test_memory/psiflow_internal/000/CP2K/transaction_log
2024/07/23 17:26:09.97 work_queue_python[178502] wq: Task 1 state change: UNKNOWN (0) to WAITING (1)
2024/07/23 17:26:09.97 work_queue_python[178502] wq: Task 2 state change: UNKNOWN (0) to WAITING (1)
2024/07/23 17:26:09.97 work_queue_python[178502] wq: Task 3 state change: UNKNOWN (0) to WAITING (1)
2024/07/23 17:26:09.97 work_queue_python[178502] wq: Task 4 state change: UNKNOWN (0) to WAITING (1)
2024/07/23 17:26:09.97 work_queue_python[178502] rmonitor: monitoring process: 178502
2024/07/23 17:26:14.45 work_queue_python[178502] tcp: accepted connection from 10.253.8.226 port 37638
2024/07/23 17:26:14.45 work_queue_python[178502] wq: worker 10.253.8.226:37638 connected
2024/07/23 17:26:14.45 work_queue_python[178502] wq: workers connections -- known: 0, connecting: 0, available: 0.
2024/07/23 17:26:14.45 work_queue_python[178502] wq: rx from unknown (10.253.8.226:37638): workqueue 11 nid002504 Linux x86_64 7.11.0
2024/07/23 17:26:14.45 work_queue_python[178502] wq: 1 workers are connected in total now
2024/07/23 17:26:14.45 work_queue_python[178502] wq: nid002504 (10.253.8.226:37638) running CCTools version 7.11.0 on Linux (operating system) with architecture x86_64 is ready
2024/07/23 17:26:14.45 work_queue_python[178502] wq: rx from nid002504 (10.253.8.226:37638): info worker-id worker-46a6c8f2e62174fa8f06722a16e5b9cd
2024/07/23 17:26:14.45 work_queue_python[178502] wq: rx from nid002504 (10.253.8.226:37638): alive
2024/07/23 17:26:14.45 work_queue_python[178502] wq: rx from nid002504 (10.253.8.226:37638): resource workers 1 1 1
2024/07/23 17:26:14.45 work_queue_python[178502] wq: rx from nid002504 (10.253.8.226:37638): resource disk 218602 218602 218602
2024/07/23 17:26:14.45 work_queue_python[178502] wq: rx from nid002504 (10.253.8.226:37638): resource memory 257207 257207 257207
2024/07/23 17:26:14.45 work_queue_python[178502] wq: rx from nid002504 (10.253.8.226:37638): resource gpus 0 0 0
2024/07/23 17:26:14.45 work_queue_python[178502] wq: rx from nid002504 (10.253.8.226:37638): resource cores 128 128 128
2024/07/23 17:26:14.45 work_queue_python[178502] wq: rx from nid002504 (10.253.8.226:37638): resource tag 0
2024/07/23 17:26:14.46 work_queue_python[178502] wq: rx from nid002504 (10.253.8.226:37638): info end_of_resource_update 0
2024/07/23 17:26:14.46 work_queue_python[178502] wq: rx from nid002504 (10.253.8.226:37638): info tasks_running 0
2024/07/23 17:26:14.46 work_queue_python[178502] wq: rx from nid002504 (10.253.8.226:37638): info worker-end-time 1721746575
2024/07/23 17:26:14.46 work_queue_python[178502] wq: nid002504 (10.253.8.226:37638) needs file /pfs/lustrep4/scratch/project_465000847/svandenhaute/test_psiflow/envs/_psiflow_env/lib/python3.10/site-packages/parsl/executors/workqueue/exec_parsl_function.py (offset 0 length 161) as 'file-0-d62753fb543c3d48a419cb609a4e16bc-exec_parsl_function.py'
2024/07/23 17:26:14.46 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): put file-0-d62753fb543c3d48a419cb609a4e16bc-exec_parsl_function.py 7802 0660
2024/07/23 17:26:14.47 work_queue_python[178502] wq: nid002504 (10.253.8.226:37638) received 0.01 MB in 0.01s (0.99s MB/s) average 0.99s MB/s
2024/07/23 17:26:14.47 work_queue_python[178502] wq: nid002504 (10.253.8.226:37638) needs file /pfs/lustrep4/scratch/project_465000847/svandenhaute/test_psiflow/test_memory/psiflow_internal/000/CP2K/function_data/0001/function (offset 0 length 131) as 'file-10-0c5986f34b862cab43c0141a2acf0dfa-function'
2024/07/23 17:26:14.47 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): put file-10-0c5986f34b862cab43c0141a2acf0dfa-function 9096 0660
2024/07/23 17:26:14.47 work_queue_python[178502] wq: nid002504 (10.253.8.226:37638) received 0.01 MB in 0.00s (10.47s MB/s) average 1.94s MB/s
2024/07/23 17:26:14.47 work_queue_python[178502] wq: nid002504 (10.253.8.226:37638) needs file /pfs/lustrep4/scratch/project_465000847/svandenhaute/test_psiflow/test_memory/psiflow_internal/000/CP2K/function_data/0001/map (offset 0 length 126) as 'file-11-e0cd25a8610c6ef5ce743f16fe215957-map'
2024/07/23 17:26:14.47 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): put file-11-e0cd25a8610c6ef5ce743f16fe215957-map 632 0660
2024/07/23 17:26:14.47 work_queue_python[178502] wq: nid002504 (10.253.8.226:37638) received 0.00 MB in 0.00s (0.93s MB/s) average 1.87s MB/s
2024/07/23 17:26:14.47 work_queue_python[178502] wq: nid002504 (10.253.8.226:37638) needs file /pfs/lustrep4/scratch/project_465000847/svandenhaute/test_psiflow/envs/_psiflow_env/bin/resource_monitor (offset 0 length 104) as 'file-0-e8888e56f9280855b99cae6b997db10e-resource_monitor'
2024/07/23 17:26:14.47 work_queue_python[178502] wq: nid002504 (10.253.8.226:37638) using conservative default average transfer rate of 1.00 MB/s
2024/07/23 17:26:14.47 work_queue_python[178502] wq: nid002504 (10.253.8.226:37638) will try up to 64 seconds to transfer this 6.77 MB file.
2024/07/23 17:26:14.47 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): put file-0-e8888e56f9280855b99cae6b997db10e-resource_monitor 6773376 0775
2024/07/23 17:26:14.54 work_queue_python[178502] wq: nid002504 (10.253.8.226:37638) received 6.77 MB in 0.07s (99.22s MB/s) average 87.44s MB/s
2024/07/23 17:26:14.54 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): task 2
2024/07/23 17:26:14.54 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): cmd 286
2024/07/23 17:26:14.54 work_queue_python[178502] wq: ./cctools-monitor --no-pprint --with-output-files=cctools-monitor -dall -o cctools-monitor.debug --with-time-series -L 'memory: 64301' -L 'disk: 54650' -L 'gpus: 0' -L 'cores: 32.000' -V 'task_id: 2' -V 'category: parsl-default' --sh "python3 exec_parsl_function.py map function result"
2024/07/23 17:26:14.54 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): category parsl-default
2024/07/23 17:26:14.54 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): cores 32.000
2024/07/23 17:26:14.54 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): gpus 0
2024/07/23 17:26:14.54 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): memory 64301
2024/07/23 17:26:14.54 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): disk 54650
2024/07/23 17:26:14.54 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): infile file-0-d62753fb543c3d48a419cb609a4e16bc-exec_parsl_function.py exec_parsl_function.py 1
2024/07/23 17:26:14.54 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): infile file-10-0c5986f34b862cab43c0141a2acf0dfa-function function 0
2024/07/23 17:26:14.54 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): infile file-11-e0cd25a8610c6ef5ce743f16fe215957-map map 0
2024/07/23 17:26:14.54 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): infile file-0-e8888e56f9280855b99cae6b997db10e-resource_monitor cctools-monitor 1
2024/07/23 17:26:14.54 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): outfile file-12-dfa66d583d38c6fa2be8fdd6474c2536-result result 0
2024/07/23 17:26:14.54 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): outfile file-14-ea5c99dd4193378b5b93f3b63cc70231-wq-178502-task-2.summary cctools-monitor.summary 0
2024/07/23 17:26:14.54 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): outfile file-15-5133d53425083442b11281c36a0711af-wq-178502-task-2.debug cctools-monitor.debug 0
2024/07/23 17:26:14.54 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): outfile file-16-293a983f7e504d4f90208eb00a05039e-wq-178502-task-2.series cctools-monitor.series 0
2024/07/23 17:26:14.54 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): end
2024/07/23 17:26:14.54 work_queue_python[178502] wq: nid002504 (10.253.8.226:37638) busy on 'python3 exec_parsl_function.py map function result'
2024/07/23 17:26:14.54 work_queue_python[178502] wq: Task 2 state change: WAITING (1) to RUNNING (2)
2024/07/23 17:26:14.54 work_queue_python[178502] wq: nid002504 (10.253.8.226:37638) needs file /pfs/lustrep4/scratch/project_465000847/svandenhaute/test_psiflow/test_memory/psiflow_internal/000/CP2K/function_data/0000/function (offset 0 length 131) as 'file-2-095cc85af2d95d8f5616a6fc3fd4fdc4-function'
2024/07/23 17:26:14.54 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): put file-2-095cc85af2d95d8f5616a6fc3fd4fdc4-function 9096 0660
2024/07/23 17:26:14.54 work_queue_python[178502] wq: nid002504 (10.253.8.226:37638) received 0.01 MB in 0.00s (18.05s MB/s) average 86.98s MB/s
2024/07/23 17:26:14.54 work_queue_python[178502] wq: nid002504 (10.253.8.226:37638) needs file /pfs/lustrep4/scratch/project_465000847/svandenhaute/test_psiflow/test_memory/psiflow_internal/000/CP2K/function_data/0000/map (offset 0 length 126) as 'file-3-1d29dcd5beb008eaec885ab73cdeec47-map'
2024/07/23 17:26:14.54 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): put file-3-1d29dcd5beb008eaec885ab73cdeec47-map 632 0660
2024/07/23 17:26:14.54 work_queue_python[178502] wq: nid002504 (10.253.8.226:37638) received 0.00 MB in 0.00s (1.42s MB/s) average 86.49s MB/s
2024/07/23 17:26:14.54 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): task 1
2024/07/23 17:26:14.54 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): cmd 286
2024/07/23 17:26:14.54 work_queue_python[178502] wq: ./cctools-monitor --no-pprint --with-output-files=cctools-monitor -dall -o cctools-monitor.debug --with-time-series -L 'memory: 64301' -L 'disk: 54650' -L 'gpus: 0' -L 'cores: 32.000' -V 'task_id: 1' -V 'category: parsl-default' --sh "python3 exec_parsl_function.py map function result"
2024/07/23 17:26:14.54 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): category parsl-default
2024/07/23 17:26:14.54 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): cores 32.000
2024/07/23 17:26:14.54 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): gpus 0
2024/07/23 17:26:14.54 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): memory 64301
2024/07/23 17:26:14.54 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): disk 54650
2024/07/23 17:26:14.54 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): infile file-0-d62753fb543c3d48a419cb609a4e16bc-exec_parsl_function.py exec_parsl_function.py 1
2024/07/23 17:26:14.54 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): infile file-2-095cc85af2d95d8f5616a6fc3fd4fdc4-function function 0
2024/07/23 17:26:14.54 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): infile file-3-1d29dcd5beb008eaec885ab73cdeec47-map map 0
2024/07/23 17:26:14.54 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): infile file-0-e8888e56f9280855b99cae6b997db10e-resource_monitor cctools-monitor 1
2024/07/23 17:26:14.54 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): outfile file-4-1032308906bb3cc0cd92b8cbbabc9c77-result result 0
2024/07/23 17:26:14.54 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): outfile file-6-3df1478119fa57dab607924c29885f71-wq-178502-task-1.summary cctools-monitor.summary 0
2024/07/23 17:26:14.54 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): outfile file-7-1b00a2da53ee6c62ddefcbd9c79f7609-wq-178502-task-1.debug cctools-monitor.debug 0
2024/07/23 17:26:14.54 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): outfile file-8-81656ce91968bd982cca614ca186e13c-wq-178502-task-1.series cctools-monitor.series 0
2024/07/23 17:26:14.54 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): end
2024/07/23 17:26:14.54 work_queue_python[178502] wq: nid002504 (10.253.8.226:37638) busy on 'python3 exec_parsl_function.py map function result'
2024/07/23 17:26:14.54 work_queue_python[178502] wq: Task 1 state change: WAITING (1) to RUNNING (2)
2024/07/23 17:26:14.54 work_queue_python[178502] wq: rx from nid002504 (10.253.8.226:37638): info tasks_running 1
2024/07/23 17:26:14.54 work_queue_python[178502] wq: nid002504 (10.253.8.226:37638) needs file /pfs/lustrep4/scratch/project_465000847/svandenhaute/test_psiflow/test_memory/psiflow_internal/000/CP2K/function_data/0003/function (offset 0 length 131) as 'file-26-125d8a81db4f8eda7857532f22a8127c-function'
2024/07/23 17:26:14.54 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): put file-26-125d8a81db4f8eda7857532f22a8127c-function 9096 0660
2024/07/23 17:26:14.54 work_queue_python[178502] wq: nid002504 (10.253.8.226:37638) received 0.01 MB in 0.00s (20.53s MB/s) average 86.08s MB/s
2024/07/23 17:26:14.54 work_queue_python[178502] wq: nid002504 (10.253.8.226:37638) needs file /pfs/lustrep4/scratch/project_465000847/svandenhaute/test_psiflow/test_memory/psiflow_internal/000/CP2K/function_data/0003/map (offset 0 length 126) as 'file-27-8789ede351dacceb5a004b09ed17dae8-map'
2024/07/23 17:26:14.54 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): put file-27-8789ede351dacceb5a004b09ed17dae8-map 632 0660
2024/07/23 17:26:14.54 work_queue_python[178502] wq: nid002504 (10.253.8.226:37638) received 0.00 MB in 0.00s (1.50s MB/s) average 85.63s MB/s
2024/07/23 17:26:14.54 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): task 4
2024/07/23 17:26:14.54 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): cmd 286
2024/07/23 17:26:14.54 work_queue_python[178502] wq: ./cctools-monitor --no-pprint --with-output-files=cctools-monitor -dall -o cctools-monitor.debug --with-time-series -L 'memory: 64301' -L 'disk: 54650' -L 'gpus: 0' -L 'cores: 32.000' -V 'task_id: 4' -V 'category: parsl-default' --sh "python3 exec_parsl_function.py map function result"
2024/07/23 17:26:14.54 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): category parsl-default
2024/07/23 17:26:14.54 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): cores 32.000
2024/07/23 17:26:14.54 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): gpus 0
2024/07/23 17:26:14.54 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): memory 64301
2024/07/23 17:26:14.54 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): disk 54650
2024/07/23 17:26:14.54 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): infile file-0-d62753fb543c3d48a419cb609a4e16bc-exec_parsl_function.py exec_parsl_function.py 1
2024/07/23 17:26:14.54 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): infile file-26-125d8a81db4f8eda7857532f22a8127c-function function 0
2024/07/23 17:26:14.54 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): infile file-27-8789ede351dacceb5a004b09ed17dae8-map map 0
2024/07/23 17:26:14.54 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): infile file-0-e8888e56f9280855b99cae6b997db10e-resource_monitor cctools-monitor 1
2024/07/23 17:26:14.54 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): outfile file-28-7aa0c8834c6d2a29e265ded0ab463a22-result result 0
2024/07/23 17:26:14.54 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): outfile file-30-53354f8f55c2bbea0fd04bb2cc5f6142-wq-178502-task-4.summary cctools-monitor.summary 0
2024/07/23 17:26:14.54 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): outfile file-31-e30d78557ab0c06ac59906b471cc7c4a-wq-178502-task-4.debug cctools-monitor.debug 0
2024/07/23 17:26:14.54 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): outfile file-32-0bebf9f8832cb2065e004f58c43f3662-wq-178502-task-4.series cctools-monitor.series 0
2024/07/23 17:26:14.54 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): end
2024/07/23 17:26:14.54 work_queue_python[178502] wq: nid002504 (10.253.8.226:37638) busy on 'python3 exec_parsl_function.py map function result'
2024/07/23 17:26:14.54 work_queue_python[178502] wq: Task 4 state change: WAITING (1) to RUNNING (2)
2024/07/23 17:26:14.54 work_queue_python[178502] wq: rx from nid002504 (10.253.8.226:37638): info tasks_running 2
2024/07/23 17:26:14.54 work_queue_python[178502] wq: nid002504 (10.253.8.226:37638) needs file /pfs/lustrep4/scratch/project_465000847/svandenhaute/test_psiflow/test_memory/psiflow_internal/000/CP2K/function_data/0002/function (offset 0 length 131) as 'file-18-f248422e1675eb1c38ec29268d177658-function'
2024/07/23 17:26:14.54 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): put file-18-f248422e1675eb1c38ec29268d177658-function 9096 0660
2024/07/23 17:26:14.54 work_queue_python[178502] wq: nid002504 (10.253.8.226:37638) received 0.01 MB in 0.00s (20.86s MB/s) average 85.25s MB/s
2024/07/23 17:26:14.54 work_queue_python[178502] wq: nid002504 (10.253.8.226:37638) needs file /pfs/lustrep4/scratch/project_465000847/svandenhaute/test_psiflow/test_memory/psiflow_internal/000/CP2K/function_data/0002/map (offset 0 length 126) as 'file-19-e1b318496c60cb42a4767a0d0b3bc285-map'
2024/07/23 17:26:14.54 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): put file-19-e1b318496c60cb42a4767a0d0b3bc285-map 632 0660
2024/07/23 17:26:14.54 work_queue_python[178502] wq: nid002504 (10.253.8.226:37638) received 0.00 MB in 0.00s (1.61s MB/s) average 84.84s MB/s
2024/07/23 17:26:14.54 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): task 3
2024/07/23 17:26:14.54 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): cmd 286
2024/07/23 17:26:14.54 work_queue_python[178502] wq: ./cctools-monitor --no-pprint --with-output-files=cctools-monitor -dall -o cctools-monitor.debug --with-time-series -L 'memory: 64301' -L 'disk: 54650' -L 'gpus: 0' -L 'cores: 32.000' -V 'task_id: 3' -V 'category: parsl-default' --sh "python3 exec_parsl_function.py map function result"
2024/07/23 17:26:14.54 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): category parsl-default
2024/07/23 17:26:14.54 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): cores 32.000
2024/07/23 17:26:14.54 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): gpus 0
2024/07/23 17:26:14.54 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): memory 64301
2024/07/23 17:26:14.54 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): disk 54650
2024/07/23 17:26:14.54 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): infile file-0-d62753fb543c3d48a419cb609a4e16bc-exec_parsl_function.py exec_parsl_function.py 1
2024/07/23 17:26:14.54 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): infile file-18-f248422e1675eb1c38ec29268d177658-function function 0
2024/07/23 17:26:14.54 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): infile file-19-e1b318496c60cb42a4767a0d0b3bc285-map map 0
2024/07/23 17:26:14.54 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): infile file-0-e8888e56f9280855b99cae6b997db10e-resource_monitor cctools-monitor 1
2024/07/23 17:26:14.54 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): outfile file-20-e31e5ee411be512d229b47357b02189a-result result 0
2024/07/23 17:26:14.54 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): outfile file-22-9b599bc38f13b7ded3844d9703da283e-wq-178502-task-3.summary cctools-monitor.summary 0
2024/07/23 17:26:14.54 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): outfile file-23-89606485e8f633cf54d1324696d04bd9-wq-178502-task-3.debug cctools-monitor.debug 0
2024/07/23 17:26:14.54 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): outfile file-24-429b0fd8ebe9585377105919af5d849f-wq-178502-task-3.series cctools-monitor.series 0
2024/07/23 17:26:14.54 work_queue_python[178502] wq: tx to nid002504 (10.253.8.226:37638): end
2024/07/23 17:26:14.54 work_queue_python[178502] wq: nid002504 (10.253.8.226:37638) busy on 'python3 exec_parsl_function.py map function result'
2024/07/23 17:26:14.54 work_queue_python[178502] wq: Task 3 state change: WAITING (1) to RUNNING (2)
2024/07/23 17:26:14.54 work_queue_python[178502] wq: rx from nid002504 (10.253.8.226:37638): info tasks_running 3
2024/07/23 17:26:14.54 work_queue_python[178502] wq: rx from nid002504 (10.253.8.226:37638): info tasks_running 4
2024/07/23 17:26:30.08 work_queue_python[178502] wq: Failed to read from worker nid002504 (10.253.8.226:37638)
2024/07/23 17:26:30.08 work_queue_python[178502] wq: worker nid002504 (10.253.8.226:37638) removed
2024/07/23 17:26:30.08 work_queue_python[178502] wq: Task 1 state change: RUNNING (2) to WAITING (1)
2024/07/23 17:26:30.08 work_queue_python[178502] wq: workers connections -- known: 1, connecting: 0, available: 1.
2024/07/23 17:26:30.08 work_queue_python[178502] wq: Task 2 state change: RUNNING (2) to WAITING (1)
2024/07/23 17:26:30.08 work_queue_python[178502] wq: Task 3 state change: RUNNING (2) to WAITING (1)
2024/07/23 17:26:30.08 work_queue_python[178502] wq: Task 4 state change: RUNNING (2) to WAITING (1)
2024/07/23 17:26:30.08 work_queue_python[178502] tcp: disconnected from 10.253.8.226 port 37638
2024/07/23 17:26:30.08 work_queue_python[178502] wq: 0 workers connected in total now
2024/07/23 17:26:39.09 work_queue_python[178502] rmonitor: monitoring process: 178502
2024/07/23 17:27:09.12 work_queue_python[178502] rmonitor: monitoring process: 178502
2024/07/23 17:27:13.13 work_queue_python[178502] wq: Task 5 state change: UNKNOWN (0) to WAITING (1)
2024/07/23 17:27:13.13 work_queue_python[178502] wq: workers connections -- known: 0, connecting: 0, available: 0.
2024/07/23 17:27:13.13 work_queue_python[178502] wq: Task 6 state change: UNKNOWN (0) to WAITING (1)
2024/07/23 17:27:13.13 work_queue_python[178502] wq: Task 7 state change: UNKNOWN (0) to WAITING (1)

0 replies

svandenhaute · 2024-07-23T15:21:11Z

svandenhaute
Jul 23, 2024
Author

udpate: the ulimit attempt does appear to work now, I think.

0 replies

dthain · 2024-07-23T16:33:31Z

dthain
Jul 23, 2024
Maintainer

Hmm, that is interesting. In the manager log that you shared, it appears that the worker crashes in about 15 seconds. The default measurement interval is every 5 seconds, I believe. It may be that the application "escapes" between the measurement interval.

@btovar I don't recall whether the RM limits virtual memory, resident memory, or something else, can you fill us in?

If I recall, we didn't use setrlimit because it only applies to a single process, not a process tree. Hence the need for the RM to measure every N seconds. That said, it might still make sense for the RM (or the worker) to apply setrlimit since that would be effective for a single process and would provide an outer envelope in the case of multiple processes...

0 replies

svandenhaute · 2024-07-23T16:37:08Z

svandenhaute
Jul 23, 2024
Author

In this particular case, the task is in fact mpirun -np 32 something so if you want me to test whether some limits actually do apply as envelope to the entire task, let me know.

0 replies

btovar · 2024-07-23T16:39:39Z

btovar
Jul 23, 2024
Maintainer

The limits for memory in taskvine and wq are for resident memory. The RM can limit virtual memory, but it cannot be activated through the python API (only cores, resident memory, disk, and gpus).

0 replies

dthain · 2024-07-31T16:08:33Z

dthain
Jul 31, 2024
Maintainer

Coming back to this after a brief break and remembering a few things...

As I recall, the ulimit -v command corresponds to setrlimit(RLIMIT_AS,...) which is capable of enforcing limits on total virtual memory size. However, it does not place any limits on the memory resident set size (RSS). I think this is primarily because the user can control how the program allocates virtual address space, but it is the kernel's choice how much physical memory to give the process.

Now, generally, in the HPC/HTC world, we care about actual physical memory used, and don't care about virtual address space. This is because (almost always) we want to give processes enough physical memory to run at full speed, and we don't want them swapping to virtual memory. If a process is swapping, we (almost always) want to move it to another machine where it has enough memory. Also, a wide variety of programs tend to allocate far more virtual address space than they actually used, due to garbage collection, memory mapped files, and other typical Unix tricks.

Finally, it is quite rare for a machine to reach a true kernel OOM condition, because that requires first exhausting all of physical memory and then allocating all of the disk blocks in swap. The machine typically becomes unusably slow long before reaching that condition.

So: WorkQueue/TaskVine are focused on managing resident physical memory, not virtual address space. And so that's what the resource monitor is intended to measure. (And for that matter, there is no ulimit capability that limits resident physical memory.)

This is all a roundabout way to ask @svandenhaute:

What is it that you actually want to limit? Is your concern about the physical
memory allocated to each task, or the total virtual memory space?

1 reply

svandenhaute Aug 7, 2024
Author

I honestly can't tell. Whichever CP2K was wrongly consuming too much, which I think is virtual memory, because, in the end, I did manage to solve it by executing ulimit -v ... within the container, right before launching mpirun.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WQ] enforce memory limits #3905

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 8 comments 1 reply

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

[WQ] enforce memory limits #3905

svandenhaute Jul 23, 2024

Replies: 8 comments · 1 reply

svandenhaute Jul 23, 2024 Author

dthain Jul 23, 2024 Maintainer

svandenhaute Jul 23, 2024 Author

svandenhaute Jul 23, 2024 Author

dthain Jul 23, 2024 Maintainer

svandenhaute Jul 23, 2024 Author

btovar Jul 23, 2024 Maintainer

dthain Jul 31, 2024 Maintainer

svandenhaute Aug 7, 2024 Author

svandenhaute
Jul 23, 2024

Replies: 8 comments 1 reply

svandenhaute
Jul 23, 2024
Author

dthain
Jul 23, 2024
Maintainer

svandenhaute
Jul 23, 2024
Author

svandenhaute
Jul 23, 2024
Author

dthain
Jul 23, 2024
Maintainer

svandenhaute
Jul 23, 2024
Author

btovar
Jul 23, 2024
Maintainer

dthain
Jul 31, 2024
Maintainer

svandenhaute Aug 7, 2024
Author