Im wondering about the different status under running jobs at qmon.
I have looked in the documentation and the help page but i cant find the job status for t.
Im guessing it means terminate sinc my jobs with my infiniband currently docent work ^^ but if it is something else it wold be good to know.
Thanks for the help.
a job has the status "t" (transfering) when qmaster has sent a job to a execution daemon. Once the job is reported by the daemon the state will switch to e.g. "r" for running.
I have a problem with jobs which have run completely, the program ended, but jobs remain in qstat output with state "t".
The only solution I've come up with is to do a qdel -f on the jobs (force option), because the qdel command registers the jobs for deletion, but doesn't actually delete them.
This problem happens only with one host. Do you have any idea why the jobs never switched to "r" ? Is this a problem with the execution daemon ? How is the daemon connected to hosts ? Is it supposed to run on hosts or on the master ?
Thanks for your help,
I've started to figure out my problem looking at the log under $SGE_ROOT/$SGE_CELL/spool/<hostname> on my host :
04/17/2012 13:09:20| main|myhost|W|can't register at qmaster "mymaster": abort qmaster registration due to communication errors
04/17/2012 13:09:20| main|myhost|E|commlib error: endpoint is not unique error (endpoint "myhost/execd/1" is already connected)
04/18/2012 13:33:44| main|myhost|W|can't register at qmaster "mymaster": abort qmaster registration due to communication errors