Check your job status on cluster
Intro: HP XC-4000 cluster use LSF and SLURM together to manage/monitor your job. The command to monitor your job is "bjobs"
Example output:
[testy@n137 ~]$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
667988 testy PEND private n63 gethosts Aug 31 09:47
[testy@n137 ~]$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
667988 testy RUN private n63 lsfhost.l gethosts Aug 31 09:47
The STAT field shows the status of the job and can be one of the following values:
* PEND: the job is pending in the queue
* RUN: the job has been launched and is running on a system
* SSUSP: the job has been automatically suspended by the system (and will be automatically resumed when favorable conditions return)
* USUSP: the job has been suspended by the owner or administrator via the bstop command (bresume resumes that job)
* PSUSP: the job has been suspended by the owner or administrator while pending.
For detailed information of the job, use -l as argument
[testy@n63 ~]$ bjobs -l 667988
If you want to know which node has been collected, use the following command to grep the SLURM allocation
[testy@n137 ~]$ bjobs -l 667988 | grep slurm
Mon Sep 4 13:52:43: slurm_id=284200;ncpus=4;slurm_alloc=n[18];
The example ouput shows slurm_alloc=n[18], which means the system collected node n18 for your program.
From time to time, you want to know whether the sytem is busy, because that will decide when resources will be available and when your job will begin to run.
To check all jobs on the cluster
[testy@n137 ~] bjobs -u all
To check all jobs on the PEND status
[testy@n137 ~] bjobs -u all | grep PEND
If you just want to know the number of jobs with pending status
[testy@n137 ~] bjobs -u all | grep PEND | wc -l
If you have a job array, you will see (please refer to "how to submit massive serial job")
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
667990 testy RUN private n63 lsfhost.loc *thosts[1] Aug 31 10:24
667990 testy RUN private n63 lsfhost.loc *thosts[2] Aug 31 10:24
667990 testy RUN private n63 lsfhost.loc *thosts[3] Aug 31 10:24
667990 testy RUN private n63 lsfhost.loc *thosts[4] Aug 31 10:24
667990 testy RUN private n63 lsfhost.loc *thosts[5] Aug 31 10:24
667990 testy PEND private n63 lsfhost.loc *thosts[6] Aug 31 10:24
667990 testy PEND private n63 lsfhost.loc *thosts[7] Aug 31 10:24
667990 testy PEND private n63 lsfhost.loc *thosts[8] Aug 31 10:24
As you can see, some are running and some are on pending.
|