Often times nodes in the various clusters are marked offline.  Sometimes, SLURM automatically sets a node offline if there is a problem communicating with it.  Other times, CCR system administrators mark nodes offline for troubleshooting problems or testing upgrades. You can see the reason why nodes are set offline using the following commands:

To see all offline nodes in academic (UB-HPC) cluster:

sinfo -R

[ccruser@rush:~]$ sinfo -R

REASON             USER   TIMESTAMP            NODELIST

bad hard drive     djm29  2016-11-14T16:11:33  cpn-d09-38-01

NHC: check_ps_cpu: root   2016-11-15T12:03:49  cpn-k10-05-01 

bad disk           root   2016-11-03T10:07:32  cpn-d07-33-02

Not responding     slurm  2016-11-10T09:16:25  cpn-k10-23-01

gres/gpu count too slurm  2016-10-25T16:01:08  cpn-k06-11-02

gres/gpu count too slurm  2016-10-25T16:01:06  cpn-k06-12-01

To see offline nodes in other clusters:

MAE cluster:  sinfo -R -M mae

Physics cluster:  sinfo -R -M physics

Industry cluster:  sinfo -R -M industry
*NOTE: When logged into presto, you can use:  sinfo -R

Chemistry cluster:  sinfo -R -M chemistry

To see a specific partition in a cluster, add the -p option.  For example, show the offline nodes in the ezurek partition of the Chemistry cluster:

sinfo -R -M chemistry -p ezurek

[ccruser@rush:~]$ sinfo -R -M chemistry -p ezurek

CLUSTER: chemistry


bad hard drive  djm29  2016-11-15T13:52:17  cpn-d12-10

List of possible reasons nodes are offline:

 If a system admin sets a node offlline, they will enter a description about what is wrong.  Examples of this include, but are not limited to:

  • hardware failure
  • bad hard drive
  • IB problem (indicates problem with infiniband network)
  • testing
  • DEAD (indicates node is not repairable and offline permanently)

If SLURM takes the node offline, you may see reasons such as:

  • gres/gpu count too low (indicates a problem with a GPU node not reporting the correct number of GPUs)
  • not responding (SLURM can't communicate with node)
  • low RealMemory (node isn't reporting correct amount of RAM)

We run a script called Node Health Check that will verify a node is ready to run jobs.  If a node fails any tests, the NHC script will set it offline.  You may see reasons such as:

  • NHC:  check_ps_cpu (indicates a run away process on a node)
  • NHC: check_fs  (indicates one or more of the network filesystems is not mounted correctly or a directory is near capacity)
  • NHC: nv_health (indicates a problem with a NVIDIA GPU)
  • NHC:  check_hw_ib  (indicates an infiniband problem)
  • NHC: check_ps_daemon (indicates a problem with the authentication service)

CCR staff monitor all clusters daily and work as quickly as possible to repair problems and return nodes to production.