[WQ] how does a worker determine what cores are used for which tasks? #3886
Replies: 18 comments
-
Work Queue (and TaskVine) do not currently bind tasks to specific cores. As you note, they simply ensure that the total number of cores is respected, and let the OS takes care of the assignments in the usual way. This is in part because tasks are (often) not single processes but often complex process trees. GPUs are different because they are not really managed in any effective way by the OS. Unless otherwise instructed, applications tend to use whatever GPUs they can get their hands on. And so, we have to do some specific assignments to avoid chaos. Is there a specific reason that you would like a specific binding to be done? |
Beta Was this translation helpful? Give feedback.
-
I'm in the situation where a single task consists of running a number of separate Python processes in parallel in the background, each of which loads and performs PyTorch CPU and/or GPU calculations: python client.py --device=0 &
python client.py --device=1 &
python client.py --device=2 &
... I've noticed that the CPU performance of each client degrades severely the more clients are present, even though I make sure that the ratio (ncores / nclients) is constant. Something inefficient is therefore going on, but there can be a number of reasons. PyTorch's performance tuning guide mentions that for efficient multithreading, it's important to set some env variables such as Since the GPU performance is fine, I'm expecting that it has to do with processes / threads interfering with each other. Do you have suggestions on how to tackle this? At any given moment, it is possible that a single worker needs to execute multiple tasks. For example, if the worker is spawned on a compute node with 64 cores and 8 GPUs, it is possible that it needs to execute two tasks, which each require four clients to be spawned. |
Beta Was this translation helpful? Give feedback.
-
Hmm, that's a tricky one. One possible problem is this: if each task is attempting to use all the available cores on the machine, then running more clients will just slow things down. In general, we try to force/encourage tasks to only use the resources that they have been assigned. For example, WQ automatically sets But there a number of ways that apps and libraries may escape or ignore those controls. It may be necessary to "try harder" to get them to respect the assigned resources. @btovar could you give us some suggestions here on how to instrument the tasks, and determine whether they are using the assigned number of cores? |
Beta Was this translation helpful? Give feedback.
-
as a comparison, Parsl's high throughput executor can pin specific workers to specific cores: The use case there is not about oversubscribing the CPU / controlling the number of concurrent processes/threads... ... but about pinning specific tasks to stay on specific cores, to avoid migrating processes between cores - so that user's work processes are pinned to a specific set of cores chosen by the parsl high throughput executor worker pool. Having parsl's worker pool choose the pins means that the pins don't overlap with other work, which parsl chooses to pin to different cores. (that's basically the same as GPU handling which is always this explicit, I think) In Parsl land, that's especially something that people running on ALCF's Polaris seem interested in (to do with locality of things that I don't know much about), but I've also personally wanted to do it on NERSC machines. |
Beta Was this translation helpful? Give feedback.
-
(more generally in my head, there are fungible and non-fungible things to allocate: the fungible ones we can just count (the "don't overload the CPU" approach to cpu allocation). The non-fungible ones we have to name and allocate. GPUs, for example, and cores-that-we-want-to-pin in this issue, but also in parsl we do some MPI rank allocation stuff where we need to allocate particular MPI ranks to run stuff on for different tasks) |
Beta Was this translation helpful? Give feedback.
-
Since cores are not oversubscribed, I wonder if we can do this entirely from the workers' side. The worker calls |
Beta Was this translation helpful? Give feedback.
-
@btovar in parsl this happens entirely in so that means "yes"? |
Beta Was this translation helpful? Give feedback.
-
Hold on, I think we are moving toward a solution without first fully understanding the problem. Sander is telling us that adding more CPU-bound processes slows the work down. That at least sounds like the CPUs are oversubscribed by accident. (And affinity doesn't solve that problem.) I think what we need to do first is measure what those tasks are doing first. BenT, can you suggest the best way for Sander to observe how many cores are actually used by each task? |
Beta Was this translation helpful? Give feedback.
-
We can give something like this a try, assuming that the resource_monitor -Omon.summary python client.py --device=0 At the end of execution |
Beta Was this translation helpful? Give feedback.
-
Also, this just occurred to me. Suppose that you define a task However, if your task consists of a shell script that runs multiple processes like you indicated:
Then each of those three processes will read |
Beta Was this translation helpful? Give feedback.
-
this should be pretty easy to measure - i suggested to @svandenhaute elsewhere to try a manual setup of what a node would look like with cpu pinning (not using WQ) and in that setup, looking at who is trying to use how much CPU should also be visible |
Beta Was this translation helpful? Give feedback.
-
Thanks for all the suggestions, am now recreating a manual example outside of WQ. I have
Very valid point, I normally print the number of threads PyTorch uses (via |
Beta Was this translation helpful? Give feedback.
-
summary_8client.txt This is what I get from |
Beta Was this translation helpful? Give feedback.
-
I have not tried to properly set e.g. OMP_NUM_THREADS to N and override SLURM's default of 8N when asking for a job with 1 task and 8N CPUs per task (mimicking the actual scenario that will happen within the WQ/Parsl framework. |
Beta Was this translation helpful? Give feedback.
-
The wall time seems close? 102s for 1client, and 129s for 8 clients. However the cpu time for 8 clients is much larger (~30s vs ~157s). If they are performing the same kind work, is this the overhead you were talking about? |
Beta Was this translation helpful? Give feedback.
-
How is the CPU time computed? Is it the cumulative time that any of the threads in the process was actually getting executed on any of the cores? In that case, yes, that overhead is nonsensical. When printing CPU affinities, it seems that SLURM's native setup ensures that the 1-client process gets bound to 7x2=14 hyperthreads, whereas in the case of 8 clients the affinity is simply not set (i.e. it looks like it's bound to all available hyperthreads on the node) |
Beta Was this translation helpful? Give feedback.
-
This was indeed the issue -_- Thanks for the help! |
Beta Was this translation helpful? Give feedback.
-
Glad you were able to work it out. I'm going to change this issue over to a "Discussion" in case others run into a similar issue. |
Beta Was this translation helpful? Give feedback.
-
WQ has specific functionality which assigns GPUs to tasks, and it keeps track of which GPUs are assigned to which tasks. Does something similar exist for CPUs?
For example, consider a worker with 64 cores, which suddenly receives tasks of 32 cores each. It knows that it has just enough resources to start both tasks, so it will start both tasks. How do the 64 cores get distributed for each task? For specific, within a task, what would be the output of e.g.
python -c 'import os; print(os.sched_getaffinity(0))'
Suppose that these tasks benefit greatly from process / thread binding to cores (e.g. PyTorch). How should the binding be performed in this case? It is unknown which cores actually belong to which task.
For GPUs, this is conveniently handled by WQ itself since he just renames the available GPUs per task starting at 0, so tasks can just assume GPU 0 is available (if only one is requested), or GPU 0,1 are available( if two are requested) etc ...
(If I read the code correctly, this is mostly implemented in work_queue_gpus.c
tagging @benclifford because this came up in a discussion with him
Beta Was this translation helpful? Give feedback.
All reactions