This content has been marked as final. Show 3 replies
a job has the status "t" (transfering) when qmaster has sent a job to a execution daemon. Once the job is reported by the daemon the state will switch to e.g. "r" for running.
I have a problem with jobs which have run completely, the program ended, but jobs remain in qstat output with state "t".
The only solution I've come up with is to do a qdel -f on the jobs (force option), because the qdel command registers the jobs for deletion, but doesn't actually delete them.
This problem happens only with one host. Do you have any idea why the jobs never switched to "r" ? Is this a problem with the execution daemon ? How is the daemon connected to hosts ? Is it supposed to run on hosts or on the master ?
Thanks for your help,
I've started to figure out my problem looking at the log under $SGE_ROOT/$SGE_CELL/spool/<hostname> on my host :
04/17/2012 13:09:20| main|myhost|W|can't register at qmaster "mymaster": abort qmaster registration due to communication errors
04/17/2012 13:09:20| main|myhost|E|commlib error: endpoint is not unique error (endpoint "myhost/execd/1" is already connected)
04/18/2012 13:33:44| main|myhost|W|can't register at qmaster "mymaster": abort qmaster registration due to communication errors