Since a couple of days I am not able anymore to submit jobs with both a hard reservation for virtual_free in a parallel environment with more than 2 slots. I am talking about smp jobs (not mpi) in the default PE "make"
So this does not work:
$ qsub -pe make 10 -l vf=20G <jobfile>
But this still works:
$ qsub -pe make 10<jobfile>
$ qsub -l vf=20G <jobfile>
When it does not work I get this nice failure:
Job 944087 cannot run in PE "make" because it only offers 0 slots
verification: no suitable queues.
Here comes a lots of output:
[bbnof@gquest] ~ $ qconf -sq main
qtype BATCH INTERACTIVE
pe_list make openmpi
[bbnof@gquest] ~ $ qconf -sp make
Complex values are configured for each node; each node has 24 "hthreads" of which 22 slots can be used (see queue conf).
[bbnof@gquest] ~ $ qconf -se biogridn16
Also when I check for my grid for 40G (qhost -l vf=40G) , I get a lots of hosts back with these resources; if I check qstat for free slots, same story, lots of resources available.
Any input is welcome as my users are not able to queue big memory smp jobs anymore due to this...
Running on: SGE 6.2u5 (lx24-amd64) with RHEL 5.8
Edited by: user13629650 on Aug 2, 2012 5:50 AM
Edited by: user13629650 on Aug 2, 2012 6:04 AM
It seems that if I remove the virtual_free from the host's complex values, the jobs get queued again.
However then it seems virtual_free is not checked anymore, because I can heavily oversubscribe a system in this manner.
Update, almost figured this out...
When running a job with a virtual_free reservation of 10G, it seems to consume 40G in the configured consumable virtual_free (so a factor 4!).
I used "qhost -F -h <exec node>" to notice this.
How is this possible? Is this some setting?