[WQ] enforce memory limits #3905
Replies: 8 comments 1 reply
-
I've figured out a solution: cgroups. In my particular case, it was especially easy. All tasks are executed within a container (apptainer or singularity), and the runtimes come with a I would close the issue, but perhaps this should be transformed into a discussion in order to keep it around. |
Beta Was this translation helpful? Give feedback.
-
If a container solves your problem nicely, that's great. For situations where a container cannot be deployed, we also have the Resource Monitor that can be enabled in WQ to observe, enforce, and report resource utilization. |
Beta Was this translation helpful? Give feedback.
-
The container was supposed to solve the problem, but only very recent versions of The next thing I tried was setting When enabling the resource monitor, the task still crashes the worker, instead of the worker killing the task...
|
Beta Was this translation helpful? Give feedback.
-
udpate: the |
Beta Was this translation helpful? Give feedback.
-
Hmm, that is interesting. In the manager log that you shared, it appears that the worker crashes in about 15 seconds. The default measurement interval is every 5 seconds, I believe. It may be that the application "escapes" between the measurement interval. @btovar I don't recall whether the RM limits virtual memory, resident memory, or something else, can you fill us in? If I recall, we didn't use |
Beta Was this translation helpful? Give feedback.
-
In this particular case, the task is in fact |
Beta Was this translation helpful? Give feedback.
-
The limits for memory in taskvine and wq are for resident memory. The RM can limit virtual memory, but it cannot be activated through the python API (only cores, resident memory, disk, and gpus). |
Beta Was this translation helpful? Give feedback.
-
Coming back to this after a brief break and remembering a few things... As I recall, the Now, generally, in the HPC/HTC world, we care about actual physical memory used, and don't care about virtual address space. This is because (almost always) we want to give processes enough physical memory to run at full speed, and we don't want them swapping to virtual memory. If a process is swapping, we (almost always) want to move it to another machine where it has enough memory. Also, a wide variety of programs tend to allocate far more virtual address space than they actually used, due to garbage collection, memory mapped files, and other typical Unix tricks. Finally, it is quite rare for a machine to reach a true kernel OOM condition, because that requires first exhausting all of physical memory and then allocating all of the disk blocks in swap. The machine typically becomes unusably slow long before reaching that condition. So: WorkQueue/TaskVine are focused on managing resident physical memory, not virtual address space. And so that's what the resource monitor is intended to measure. (And for that matter, there is no This is all a roundabout way to ask @svandenhaute: What is it that you actually want to limit? Is your concern about the physical |
Beta Was this translation helpful? Give feedback.
-
I'm dealing with a situation whereby a specific task requires (somewhat unexpectedly) much more memory than it should. Because of this, it triggers an OOM which completely kills the
work_queue_worker
process.Is there a way to make this proceed more gracefully? Ideally, the OOM in one task does not kill all other tasks which are running on that particular worker...
Beta Was this translation helpful? Give feedback.
All reactions