SLURM plugin #40

andreww · 2023-05-26T15:50:30Z

At some point it would be nice to use carbon intensity to help schedule tasks on HPC clusters. In principle the 'backend' of cats could help with this and the obvious approach is to somehow plug into SLURM. For example, on an under used cluster, it may be best to run user jobs only during low carbon intensity times and let the queue build up when carbon intensity is high. We would presumably need to build a SLURM plugin (https://slurm.schedmd.com/plugins.html) and work with a team managing a cluster. This issue is to keep track of ideas around this.

sadielbartholomew · 2023-09-12T18:01:06Z

On this topic, a colleague of mine @dlrhodson has made a nice suggestion:

I wonder if this [CATS] could be used to nudge HPC scheduling peaks to low CO2 intensity periods? I only just discovered Archer logs the energy consumed per job: https://docs.archer2.ac.uk/user-guide/energy/ !
...
I guess there's a bit of tension with machine utilization, but I bet that there is a peak in jobs that is somewhere in 9-5pm, just because that is when folk submit tasks - hence a weak coupling with mean daily working patterns - if a scheduler like CATS could nudge this peak to a lower emissions time, it could have a big effect?

where for essential context, ARCHER2 uses SLURM as its scheduler, so the plugin here would be a means towards this. Of note is (quoted from the link above):

Energy usage for a particular job may be obtained using the sacct command

and also:

On compute nodes, the raw energy counters and instantaneous power draw data are available at: /sys/cray/pm_counters

such that the information for the --jobinfo is readily available if we can interface between the storage of that and CATS.

colinsauze · 2023-09-12T18:22:58Z

We've been talking with the SSI about trying to find some target HPC systems to do exactly this.
I'm not so sure about a 9-5 peak though, most HPC systems run near 100% load most of the time and many jobs last long enough to keep them busy all night. We did some analysis on this in Supercomputing Wales and found the system was quietist from Sunday evening to the middle of Monday as most of the jobs submitted on Friday finished by Sunday evening and people took a few hours on Monday morning to start submitting new jobs.

Our ideal target system might be something that's a bit less popular, more likely to be things like departmental clusters or high throughput systems.

colinsauze · 2024-02-26T19:35:47Z

Setting this up as an issue to cover WP2 in the CATSv2 project.

colinsauze added this to the Initial sift of ideas for how to interact with Slurm milestone Feb 26, 2024

colinsauze assigned abhidg, colinsauze, tonymugen, tlestang, ljcolling, Llannelongue and sadielbartholomew Feb 26, 2024

colinsauze added the catsV2 Slurm enabled CATS label Feb 26, 2024

colinsauze mentioned this issue Feb 26, 2024

Implement Slurm Plugin #72

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SLURM plugin #40

SLURM plugin #40

andreww commented May 26, 2023

sadielbartholomew commented Sep 12, 2023

colinsauze commented Sep 12, 2023

colinsauze commented Feb 26, 2024

SLURM plugin #40

SLURM plugin #40

Comments

andreww commented May 26, 2023

sadielbartholomew commented Sep 12, 2023

colinsauze commented Sep 12, 2023

colinsauze commented Feb 26, 2024