Check out these how-to guides for monitoring your jobs


View this article about using Grafana graphs in OnDemand to get detailed hardware metrics



Users can login to compute nodes running their job. Once logged in to the compute node, the 'top' command is useful in monitoring the user processes. The top command shows the utilization of cpu and memory on the compute node. The RES field corresponds to the memory being used by the process. Typing shift+h in top shows the threaded processes.

  1. Check status of the job with squeue
  • squeue -u UBITusername
  • squeue -j jobid
  • ssh compute-node-name
  • Use 'top' or 'htop' command to show user processes



Example:

[cdc@vortex:~]$ squeue -j 2587024
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
2587024 debug hybrid_h cdc R 0:19 2 k08n41s[01-02]


[cdc@vortex:~]$ ssh k08n41s01
Last login: Wed Aug 6 10:34:14 2014 from k07n14.ccr.buffalo.edu


[cdc@k08n41s01 ~]$ top
top - 10:36:21 up 1 day, 2:36, 1 user, load average: 11.55, 5.52, 2.13
Tasks: 333 total, 2 running, 329 sleeping, 0 stopped, 2 zombie
Cpu(s):100.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 49416084k total, 7314148k used, 42101936k free, 178252k buffers
Swap: 49550328k total, 0k used, 49550328k free, 6346152k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
10188 cdc 20 0 928m 31m 9708 R 1200.4 0.1 36:13.03 hello.bin
...

[cdc@k08n41s01 ~]$ top
top - 10:37:51 up 1 day, 2:37, 1 user, load average: 11.92, 7.22, 3.05
Tasks: 379 total, 13 running, 364 sleeping, 0 stopped, 2 zombie
Cpu(s):100.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 49416084k total, 7315032k used, 42101052k free, 178292k buffers
Swap: 49550328k total, 0k used, 49550328k free, 6346288k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
10221 cdc 20 0 928m 31m 9708 R 100.3 0.1 4:31.25 hello.bin
10220 cdc 20 0 928m 31m 9708 R 100.0 0.1 4:30.78 hello.bin
10222 cdc 20 0 928m 31m 9708 R 100.0 0.1 4:31.28 hello.bin
10223 cdc 20 0 928m 31m 9708 R 100.0 0.1 4:31.32 hello.bin
10224 cdc 20 0 928m 31m 9708 R 100.0 0.1 4:30.76 hello.bin
10225 cdc 20 0 928m 31m 9708 R 100.0 0.1 4:31.34 hello.bin
10226 cdc 20 0 928m 31m 9708 R 100.0 0.1 4:30.93 hello.bin
10227 cdc 20 0 928m 31m 9708 R 100.0 0.1 4:31.35 hello.bin
10228 cdc 20 0 928m 31m 9708 R 100.0 0.1 4:30.88 hello.bin
10229 cdc 20 0 928m 31m 9708 R 100.0 0.1 4:31.34 hello.bin
10230 cdc 20 0 928m 31m 9708 R 100.0 0.1 4:31.06 hello.bin
10188 cdc 20 0 928m 31m 9708 R 99.6 0.1 4:30.87 hello.bin
10382 cdc 20 0 26164 1572 1044 R 0.3 0.0 0:00.08 top
1 root 20 0 21436 1572 1264 S 0.0 0.0 0:01.35 init
    2 root      20   0     0    0    0 S  0.0  0.0   0:00.01 kthreadd  



How to retrieve job history and accounting data