What is QOS (quality of service)?
At CCR we are using the quality of service function of the SLURM scheduler in two ways. The first is to set limits on number of jobs users can submit. To keep things fair for everyone, the most jobs a user can have in the general-compute and scavenger partitions at one time is 1000. We limit the largemem, gpu, and viz partition jobs to 32 and 4 is the max you can submit to the debug queue. We also use QOS to offer those who financially support our center a priority boost. In theory this isn't any different than it was before, except all of these limits were set on associations and by organizing it this way, we've eliminated 75% of our associations. It also gives us the flexibility in the future to use QOS in other ways. See the list below for all the possible qos's available on CCR clusters. Your account has been given access to the ones we believe you should have. If you find an error, please submit a help ticket.
List of CCR QOS settings:
debug = ub-hpc cluster, debug partition, maxSubmitJobsPerUser=4 - available to academic users only
gpu = ub-hpc cluster, gpu partition, maxSubmitJobsPerUser=4, maxNumberNodes=1 - available to academic users only
gpu2 = ub-hpc cluster, gpu partition, maxSubmitJobsPerUser=4, maxNumberNodes=4 - available to academic users only with special permission
general-compute = ub-hpc cluster, general-compute partition, maxSubmitJobsPerUser=1000 - available to academic users only
largemem = ub-hpc cluster, largemem partition, maxSubmitJobsPerUser=32 - available to academic users only
skylake = ub-hpc cluster, skylake partition, maxSubmitJobsPerUser=1000 - available to academic users only
viz = available only through the OnDemand Portal - NOTE: any jobs submitted directly to this partition will be rejected. You must use the portal
industry = industry cluster, industry partition, maxSubmitJobsPerUser=1000 - available to industry users only
supporters = priority boost available to financial contributors to CCR (How do I become a CCR supporter?)
scavenger = includes all scavenger partitions in the industry and faculty clusters - available to academic users only
Various QOS values in the faculty cluster - These are private faculty cluster QOS settings and only available to those people in the faculty group. maxSubmitJobsPerUser=1000 on all partitions
You can see what QOS settings you have access to using the slimit command:
sbatch:error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limit)
You will get this error if you have reached the limits as described above. For example, if you have 1000 jobs in the general-compute partition and try to submit another one, you will get this error. Wait for some of your jobs to finish and submit more at that time
Questions or problems?
If you believe this in error, first check what access you have been given in ColdFront (https://coldfront.ccr.buffalo.edu). The person in charge of the project you're on can add you to additional allocations that will provide access to those resources.