Scheduler & Resource Management
Design of SLURM and other schedulers for fairness, utilisation and user experience.
Schedulers are where cluster policy becomes visible to users. This service covers partition design, QoS classes, fairshare policies, reservations and preemption strategies that strike a balance between high utilisation and predictable behaviour.
We review current job size distributions, queue times and slowdown metrics to detect pathologies such as “elephant” jobs blocking everything else or short jobs stuck behind long-running tasks. Based on this, we propose changes to partitions, time limits, priority formulas and preemption rules.
We also introduce or refine accounting and reporting so that management and users see a transparent picture of how resources are being used.
Case study – Cleaning up a congested queue
A cluster had become dominated by a few very large jobs that routinely blocked smaller, urgent work. Users complained that development and testing were almost impossible during peak periods.
By splitting partitions by purpose, introducing a dedicated short queue with higher priority and implementing controlled preemption, we restored a healthy mix of jobs. Short experiments could complete quickly, while long production runs were still supported but no longer monopolised the system.