Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SLURM plugin #40

Open
andreww opened this issue May 26, 2023 · 3 comments
Open

SLURM plugin #40

andreww opened this issue May 26, 2023 · 3 comments
Labels
catsV2 Slurm enabled CATS

Comments

@andreww
Copy link
Collaborator

andreww commented May 26, 2023

At some point it would be nice to use carbon intensity to help schedule tasks on HPC clusters. In principle the 'backend' of cats could help with this and the obvious approach is to somehow plug into SLURM. For example, on an under used cluster, it may be best to run user jobs only during low carbon intensity times and let the queue build up when carbon intensity is high. We would presumably need to build a SLURM plugin (https://slurm.schedmd.com/plugins.html) and work with a team managing a cluster. This issue is to keep track of ideas around this.

@sadielbartholomew
Copy link
Member

On this topic, a colleague of mine @dlrhodson has made a nice suggestion:

I wonder if this [CATS] could be used to nudge HPC scheduling peaks to low CO2 intensity periods? I only just discovered Archer logs the energy consumed per job: https://docs.archer2.ac.uk/user-guide/energy/ !
...
I guess there's a bit of tension with machine utilization, but I bet that there is a peak in jobs that is somewhere in 9-5pm, just because that is when folk submit tasks - hence a weak coupling with mean daily working patterns - if a scheduler like CATS could nudge this peak to a lower emissions time, it could have a big effect?

where for essential context, ARCHER2 uses SLURM as its scheduler, so the plugin here would be a means towards this. Of note is (quoted from the link above):

Energy usage for a particular job may be obtained using the sacct command

and also:

On compute nodes, the raw energy counters and instantaneous power draw data are available at: /sys/cray/pm_counters

such that the information for the --jobinfo is readily available if we can interface between the storage of that and CATS.

@colinsauze
Copy link
Member

We've been talking with the SSI about trying to find some target HPC systems to do exactly this.
I'm not so sure about a 9-5 peak though, most HPC systems run near 100% load most of the time and many jobs last long enough to keep them busy all night. We did some analysis on this in Supercomputing Wales and found the system was quietist from Sunday evening to the middle of Monday as most of the jobs submitted on Friday finished by Sunday evening and people took a few hours on Monday morning to start submitting new jobs.

Our ideal target system might be something that's a bit less popular, more likely to be things like departmental clusters or high throughput systems.

@colinsauze
Copy link
Member

Setting this up as an issue to cover WP2 in the CATSv2 project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment