3 Replies Latest reply on Apr 24, 2012 9:51 AM by 932789

    Status running job

    919768
      Hi

      Im wondering about the different status under running jobs at qmon.
      I have looked in the documentation and the help page but i cant find the job status for t.
      Im guessing it means terminate sinc my jobs with my infiniband currently docent work ^^ but if it is something else it wold be good to know.

      Thanks for the help.
      Best regards
      Edvin
        • 1. Re: Status running job
          829846
          Hi Edvin,

          a job has the status "t" (transfering) when qmaster has sent a job to a execution daemon. Once the job is reported by the daemon the state will switch to e.g. "r" for running.

          Regards,
          Christian
          • 2. Re: Status running job
            932789
            Hi Christian,

            I have a problem with jobs which have run completely, the program ended, but jobs remain in qstat output with state "t".
            The only solution I've come up with is to do a qdel -f on the jobs (force option), because the qdel command registers the jobs for deletion, but doesn't actually delete them.

            This problem happens only with one host. Do you have any idea why the jobs never switched to "r" ? Is this a problem with the execution daemon ? How is the daemon connected to hosts ? Is it supposed to run on hosts or on the master ?

            Thanks for your help,
            Isabelle
            • 3. Re: Status running job
              932789
              I've started to figure out my problem looking at the log under $SGE_ROOT/$SGE_CELL/spool/<hostname> on my host :

              04/17/2012 13:09:20| main|myhost|W|can't register at qmaster "mymaster": abort qmaster registration due to communication errors
              04/17/2012 13:09:20| main|myhost|E|commlib error: endpoint is not unique error (endpoint "myhost/execd/1" is already connected)
              04/18/2012 13:33:44| main|myhost|W|can't register at qmaster "mymaster": abort qmaster registration due to communication errors