snodes is a useful command for determining what resources are installed in the clusters and also what is currently available to run jobs on.  The snodes command shows every node, what state it is in (idle, allocated, drained/offline), how many CPUs it has, which CPUs are allocated, the current load on the machine, how much RAM it has, GPU info (where applicable), what partition it's in, and the Slurm features associated with it.



[ccruser@vortex1:~]$ snodes --help

==============================================================

Display information about one or more nodes, possibly filtered

by partition and/or state.


If no node arg or 'all' is provided, all nodes will be

summarized. Similar behavior exists for the partition and

state(s) args


Usage:   snodes [node1,node2,etc.] [cluster/partition] [state(s)]


==============================================================



See everything in the default cluster:
[ccruser@vortex1:~]$ snodes

HOSTNAMES     STATE    CPUS S:C:T    CPUS(A/I/O/T)   CPU_LOAD MEMORY   GRES         PARTITION          AVAIL_FEATURES

cpn-d07-04-01 idle     8    2:4:1    0/8/0/8         0.01     23000    (null)       general-compute*   UBHPC,CPU-L5630,INTEL

cpn-d07-04-02 idle     8    2:4:1    0/8/0/8         0.05     23000    (null)       general-compute*   UBHPC,CPU-L5630,INTEL

cpn-d07-05-01 idle     8    2:4:1    0/8/0/8         0.03     23000    (null)       general-compute*   UBHPC,CPU-L5630,INTEL

cpn-d07-05-02 idle     8    2:4:1    0/8/0/8         0.01     23000    (null)       general-compute*   UBHPC,CPU-L5630,INTEL

cpn-d07-06-01 drain    8    2:4:1    0/0/8/8         0.01     23000    (null)       general-compute*   UBHPC,CPU-L5630,INTEL

cpn-d07-06-02 idle     8    2:4:1    0/8/0/8         0.01     23000    (null)       general-compute*   UBHPC,CPU-L5630,INTEL

...



See everything in a specific partition:

[ccruser@vortex1:~]$ snodes all ub-hpc/debug

HOSTNAMES     STATE    CPUS S:C:T    CPUS(A/I/O/T)   CPU_LOAD MEMORY   GRES         PARTITION          AVAIL_FEATURES

cpn-k05-22    idle     16   2:8:1    0/16/0/16       0.02     128000   (null)       debug              CPU-E5-2660,INTEL

cpn-k05-26    idle     16   2:8:1    0/16/0/16       0.01     128000   (null)       debug              UBHPC,CPU-E5-2660,INTEL

cpn-k08-34-01 idle     12   2:6:1    0/12/0/12       0.15     48000    (null)       debug              UBHPC,IB,CPU-E5645,INTEL

cpn-k08-34-02 idle     12   2:6:1    0/12/0/12       0.02     48000    (null)       debug              UBHPC,IB,CPU-E5645,INTEL

cpn-k08-40-01 idle     12   2:6:1    0/12/0/12       0.08     48000    (null)       debug              UBHPC,IB,CPU-E5645,INTEL

cpn-k08-41-01 idle     12   2:6:1    0/12/0/12       0.04     48000    (null)       debug              UBHPC,IB,CPU-E5645,INTEL

cpn-k08-41-02 idle     12   2:6:1    0/12/0/12       0.02     48000    (null)       debug              UBHPC,IB,CPU-E5645,INTEL

cpn-u28-38    mix      32   2:16:1   28/4/0/32       14.07    187000   gpu:tesla_v1 debug              UBHPC,CPU-Gold-6130,INTE



Search for a specific Slurm feature (in this example, Infiniband (IB)):

[ccruser@vortex1:~]$ snodes all ub-hpc/general-compute |grep IB

cpn-f16-03    alloc    16   2:8:1    16/0/0/16       0.02     128000   (null)       general-compute*   UBHPC,IB,CPU-E5-2660,INTEL

cpn-f16-04    alloc    16   2:8:1    16/0/0/16       1.15     128000   (null)       general-compute*   UBHPC,IB,CPU-E5-2660,INTEL

cpn-f16-05    alloc    16   2:8:1    16/0/0/16       0.37     128000   (null)       general-compute*   UBHPC,IB,CPU-E5-2660,INTEL

cpn-f16-06    alloc    16   2:8:1    16/0/0/16       0.97     128000   (null)       general-compute*   UBHPC,IB,CPU-E5-2660,INTEL

cpn-f16-07    alloc    16   2:8:1    16/0/0/16       1.09     128000   (null)       general-compute*   UBHPC,IB,CPU-E5-2660,INTEL

cpn-f16-08    alloc    16   2:8:1    16/0/0/16       15.98    128000   (null)       general-compute*   UBHPC,IB,CPU-E5-2660,INTEL

cpn-f16-09    alloc    16   2:8:1    16/0/0/16       1.11     128000   (null)       general-compute*   UBHPC,IB,CPU-E5-2660,INTEL

...



See everything in a partition in another cluster that's not the default:

[ccruser@vortex1:~]$ snodes all faculty/scavenger

HOSTNAMES     STATE    CPUS S:C:T    CPUS(A/I/O/T)   CPU_LOAD MEMORY   GRES         PARTITION          AVAIL_FEATURES

cpn-d11-01    alloc    8    2:4:1    8/0/0/8         0.39     23000    (null)       scavenger          FACULTY,CPU-L5520,INTEL

cpn-d11-02    alloc    8    2:4:1    8/0/0/8         0.41     23000    (null)       scavenger          FACULTY,CPU-L5520,INTEL

cpn-d11-03    alloc    8    2:4:1    8/0/0/8         0.01     23000    (null)       scavenger          FACULTY,CPU-L5520,INTEL

cpn-d11-04    alloc    8    2:4:1    8/0/0/8         0.46     23000    (null)       scavenger          FACULTY,CPU-L5520,INTEL

cpn-d11-05    alloc    8    2:4:1    8/0/0/8         0.14     23000    (null)       scavenger          FACULTY,CPU-L5520,INTEL

cpn-d11-07    alloc    8    2:4:1    8/0/0/8         0.97     23000    (null)       scavenger          FACULTY,CPU-L5520,INTEL

cpn-d11-08    alloc    8    2:4:1    8/0/0/8         0.69     23000    (null)       scavenger          FACULTY,CPU-L5520,INTEL

cpn-d11-09    alloc    8    2:4:1    8/0/0/8         0.04     23000    (null)       scavenger          FACULTY,CPU-L5520,INTEL

cpn-d11-10    mix      8    2:4:1    1/7/0/8         1.19     23000    (null)       scavenger          FACULTY,CPU-L5520,INTEL

cpn-d11-11    mix      8    2:4:1    1/7/0/8         1.24     23000    (null)       scavenger          FACULTY,CPU-L5520,INTEL

cpn-d11-12    alloc    8    2:4:1    8/0/0/8         0.04     23000    (null)       scavenger          FACULTY,CPU-L5520,INTEL

cpn-d11-13    alloc    8    2:4:1    8/0/0/8         3.99     48000    (null)       scavenger          FACULTY,CPU-E5520,INTEL

cpn-d11-18    alloc    8    2:4:1    8/0/0/8         8.01     23000    (null)       scavenger          FACULTY,CPU-L5520,INTEL

cpn-d12-02    mix      8    2:4:1    1/7/0/8         1.03     23000    (null)       scavenger          FACULTY,CPU-L5520,INTEL

cpn-d12-03    alloc    8    2:4:1    8/0/0/8         0.27     23000    (null)       scavenger          FACULTY,CPU-L5520,INTEL

...



Show all idle nodes in a specified cluster and partition:

[ccruser@vortex1:~]$ snodes all faculty/scavenger |grep idle

cpn-d12-09    idle     8    2:4:1    0/8/0/8         0.19     23000    (null)       scavenger          FACULTY,CPU-L5520,INTEL

cpn-d12-11    idle     8    2:4:1    0/8/0/8         0.02     23000    (null)       scavenger          FACULTY,CPU-L5520,INTEL

cpn-d12-12    idle     8    2:4:1    0/8/0/8         0.04     23000    (null)       scavenger          FACULTY,CPU-L5520,INTEL

cpn-f11-05    idle     24   2:12:1   0/24/0/24       0.04     256000   (null)       scavenger          FACULTY,CPU-E5-2650v4,INTEL

cpn-f11-06    idle     24   2:12:1   0/24/0/24       0.03     256000   (null)       scavenger          FACULTY,CPU-E5-2650v4,INTEL

cpn-f11-07    idle     24   2:12:1   0/24/0/24       0.03     256000   (null)       scavenger          FACULTY,CPU-E5-2650v4,INTEL

cpn-f11-08    idle     24   2:12:1   0/24/0/24       0.01     256000   (null)       scavenger          FACULTY,CPU-E5-2650v4,INTEL

cpn-f11-09    idle     24   2:12:1   0/24/0/24       0.01     256000   (null)       scavenger          FACULTY,CPU-E5-2650v4,INTEL

...



The format of the Slurm features in snodes command output is:

CLUSTER, CPU_MODEL, CPU_MANUFACTURER, RACK,[FUNDING SOURCE, INTERCONNECT, GPU_MODEL] 

Anything in [ ] is optional and may be dependent on what hardware is in the node



Using this node as an example, here are more specifics about the different snodes columns:


HOSTNAMES     STATE    CPUS S:C:T    CPUS(A/I/O/T)   CPU_LOAD MEMORY   GRES                                PARTITION          A

VAIL_FEATURES

cpn-q06-20    alloc    40   2:20:1   40/0/0/40       14.90    187000   gpu:tesla_v100-pcie-32gb:2(S:0-1)   general-compute*   UBHPC,CPU-Gold-6230,INTEL,q06,NIH,IB,V100

Node hostname - cpn-q06-20  CCR's naming convention uses physical rack location.  cpn=compute node, viz=visualization node, srv=server - q06 is location on the machine room floor created using an alpha-numeric grid pattern to make locating equipment easier - 20 is the slot in the rack the server is located (most racks have 40 slots or 'u')


State - alloc means it's fully in use.  Other possible states could be idle, mixed (the running job(s) haven't asked for all the resources on the node), drain (node is offline due to problems or we set it offline for maintenance)


CPUS - total number of CPUs in the node - this node has 40


S:C:T - sockets:cores:threads - this is how those total CPUS are physically made up.  This node has 2 sockets, 20 cores per socket, and 1 thread per core


CPUS(A/I/O/T) - This is how the CPUs are currently in use - A=allocated, I=idle, O=offline, T=total.  This one shows all 40 are in use.  This is a good measure to look at for your own jobs.  If you request 40 CPUs and 39 of them are idle that means your program isn't using all you thought it would and it's a waste of resources.   


CPU_LOAD - average load on the node's CPUs at the time of running the snodes command.  This node shows current CPU load is 14.90 (it is not unusual for GPU nodes to have low CPU usage - however, if you don't need all the CPUs in the node, don't request them)


MEMORY - total memory available to request for that node - this node has 187GB


GRES - GPU info, if the node has a GPU - this node has 2 Tesla V100 32GB GPUs


PARTITION - which partition(s) the node is part of - this node is in the general-compute and scavenger partitions


AVAIL_FEATURES - These are "tags" we set on the nodes to specify hardware, specific such as CPU types (INTEL CPU-Gold-6230), advanced networking (IB), what rack it is located in in the machine room (q06), what cluster it is part of (UB-HPC) and, sometimes, what grant purchased the equipment (NIH)