What is QOS (quality of service)?

At CCR we are using the quality of service function of the SLURM scheduler in two ways.  The first is to set limits on number of jobs users can submit.  To keep things fair for everyone, the most jobs a user can have in the general-compute and scavenger partitions at one time is 1000.  We limit the viz partition jobs to 1 per user and 4 is the max you can submit to the debug queue.  We also use QOS to offer those who financially support our center a priority boost.  See the list below for all the possible qos's available on CCR clusters.  Your account has been given access to the ones we believe you should have.  If you find an error, please submit a help ticket. 

List of CCR QOS settings:

debug = ub-hpc cluster, debug partition, maxSubmitJobsPerUser=4  - available to academic users only

general-compute = ub-hpc cluster, general-compute partition, maxSubmitJobsPerUser=1000  - available to academic users only

viz = available only through the OnDemand Portal - NOTE: any jobs submitted directly to this partition will be rejected.  You must use the portal

industry = ub-hpc cluster, industry partition, maxSubmitJobsPerUser=1000  - available to industry business users only

supporters = priority boost available to financial contributors to CCR (How do I become a CCR supporter?)

scavenger = includes all scavenger partitions in the industry partition and faculty clusters - available to academic users only

Various QOS values in the faculty cluster - These are private faculty cluster QOS settings and only available to those people in the faculty group.  maxSubmitJobsPerUser=1000 on all partitions

You can see what QOS settings you have access to using the slimit command:

sbatch:error: QOSMaxSubmitJobPerUserLimit
sbatch:error:  Batch job submission failed:  Job violates accounting/QOS policy (job submit limit, user's size and/or time limit)

You will get this error if you have reached the limits as described above.  For example, if you have 1000 jobs in the general-compute partition and try to submit another one, you will get this error.  Wait for some of your jobs to finish and submit more at that time

Questions or problems?

If you believe this in error, first check what access you have been given in ColdFront (https://coldfront.ccr.buffalo.edu).  The person in charge of the project you're on can add you to additional allocations that will provide access to those resources.