Users can login to compute nodes running their job.  Once logged in to the compute node, the 'top' command is useful in monitoring the  user processes.  The top command shows the utilization of cpu and memory on the compute node.  The RES field corresponds to the memory being used by the process.  Typing shift+h in top shows the threaded processes.

  1. Check status of the job with squeue
  • squeue -u UBITusername
  • squeue -j jobid
  • ssh compute-node-name
  • Use 'top' command to show user processes



Example:

[cdc@rush:~]$ squeue -j 2587024
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           2587024     debug hybrid_h      cdc  R       0:19      2 k08n41s[01-02]


[cdc@rush:~]$ ssh k08n41s01
Last login: Wed Aug  6 10:34:14 2014 from k07n14.ccr.buffalo.edu


[cdc@k08n41s01 ~]$ top
top - 10:36:21 up 1 day,  2:36,  1 user,  load average: 11.55, 5.52, 2.13
Tasks: 333 total,   2 running, 329 sleeping,   0 stopped,   2 zombie
Cpu(s):100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  49416084k total,  7314148k used, 42101936k free,   178252k buffers
Swap: 49550328k total,        0k used, 49550328k free,  6346152k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND          
10188 cdc       20   0  928m  31m 9708 R 1200.4  0.1  36:13.03 hello.bin      
...

[cdc@k08n41s01 ~]$ top
top - 10:37:51 up 1 day,  2:37,  1 user,  load average: 11.92, 7.22, 3.05
Tasks: 379 total,  13 running, 364 sleeping,   0 stopped,   2 zombie
Cpu(s):100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  49416084k total,  7315032k used, 42101052k free,   178292k buffers
Swap: 49550328k total,        0k used, 49550328k free,  6346288k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND          
10221 cdc       20   0  928m  31m 9708 R 100.3  0.1   4:31.25 hello.bin        
10220 cdc       20   0  928m  31m 9708 R 100.0  0.1   4:30.78 hello.bin        
10222 cdc       20   0  928m  31m 9708 R 100.0  0.1   4:31.28 hello.bin        
10223 cdc       20   0  928m  31m 9708 R 100.0  0.1   4:31.32 hello.bin        
10224 cdc       20   0  928m  31m 9708 R 100.0  0.1   4:30.76 hello.bin        
10225 cdc       20   0  928m  31m 9708 R 100.0  0.1   4:31.34 hello.bin        
10226 cdc       20   0  928m  31m 9708 R 100.0  0.1   4:30.93 hello.bin        
10227 cdc       20   0  928m  31m 9708 R 100.0  0.1   4:31.35 hello.bin        
10228 cdc       20   0  928m  31m 9708 R 100.0  0.1   4:30.88 hello.bin        
10229 cdc       20   0  928m  31m 9708 R 100.0  0.1   4:31.34 hello.bin        
10230 cdc       20   0  928m  31m 9708 R 100.0  0.1   4:31.06 hello.bin        
10188 cdc       20   0  928m  31m 9708 R 99.6  0.1   4:30.87 hello.bin        
10382 cdc       20   0 26164 1572 1044 R  0.3  0.0   0:00.08 top              
    1 root      20   0 21436 1572 1264 S  0.0  0.0   0:01.35 init              
    2 root      20   0     0    0    0 S  0.0  0.0   0:00.01 kthreadd  



Monitoring a job using slurmjobvis graphical tool


How to retrieve job history and accounting data