5 Replies Latest reply: Jun 11, 2012 11:02 AM by Todd Little-Oracle RSS

    deadlock when queues are empty and no messages are comming, 100% cpu consum

    zuchy
      Hello tuxedo users,
      I am facing to the problem with deadlocks on queues.
      The queues are nonpersistent and we have to get the messages from the queue as soon as possible.
      So we are using tmqforward with -i -420 (it is 420msec between delays when queue is empty).
      The definition of queue in ubb is:

      Q_TP1          LMID=TuxA     GRPNO=119     TMSCOUNT=5     TMSNAME=TMS_QM          OPENINFO="TUXEDO/QM:/home/tux-st2/queues/QUEMSG_TP:QSPACE_TP_A"

      TMQUEUE               SRVID=1230     MIN=1 MAX=3     SRVGRP=Q_TP1     CLOPT="-o logs/stdout -e logs/stderr -s QSPACE_TP_A,QSPACE_TP:TMQUEUE --"
      TMQFORWARD     SRVID=1245     MIN=3 MAX=4 SRVGRP=Q_TP1      GRACE=0 RESTART=Y CONV=N MAXGEN=10 CLOPT="-o logs/stdout -e logs/stderr -- -i -420 -n -q TP_REQ"

      The problem is, that if there are no messages comming, there is sometimes (ones per 14days) deadlock detected and TMQUEUE, TMQFORWARD and TMS_QM are from the deadlock time to forever consuming 100% cpu and client will receive TPEBLOCK. Th /WS client is also periodically using signal sending command to interrupt the tpdequeue. The client is trying to send message to request queue - on this queue is tmqforward which i trying to send it asap to the framework and the answer is inserted to reply queue - where the client is listening using tpdequeue TPQWAIT.

      This is the problematic log situation:

      000646.tuxa!TMQUEUE.28248.1.0: gtrid x0 x4fc3912d x72b519: Q_CAT:1447: WARN: [Semaphore appears stuck - currently held by 28111]
      000646.tuxa!TMQUEUE.28248.1.0: gtrid x0 x4fc3912d x72b519: : additional deadlock diagnostic (28248/0/670/4/4/14/-1/28111)
      000646.tuxa!TMS_QM.28111.1.0: gtrid x0 x4fc3912d x72b519: Q_CAT:1447: WARN: [Semaphore appears stuck - currently held by 28248]
      000646.tuxa!TMS_QM.28111.1.0: gtrid x0 x4fc3912d x72b519: : additional deadlock diagnostic (28111/0/670/5/14/4/-1/28248)
      000646.tuxa!TMQFORWARD.1111.1.0: gtrid x0 x4fc3912d x72b51c: Q_CAT:1447: WARN: [Semaphore appears stuck - currently held by 28111]
      000646.tuxa!TMQFORWARD.1111.1.0: gtrid x0 x4fc3912d x72b51c: : additional deadlock diagnostic (28248/0/670/4/4/14/-1/28111)
      000646.tuxa!TMQFORWARD.835.1.0: gtrid x0 x4fc3912d x72b533: Q_CAT:1447: WARN: [Semaphore appears stuck - currently held by 28111]
      000646.tuxa!TMQFORWARD.835.1.0: gtrid x0 x4fc3912d x72b533: : additional deadlock diagnostic (28248/0/670/4/4/14/-1/28111)
      000646.tuxa!TMQFORWARD.1112.1.0: gtrid x0 x4fc3912d x72b531: Q_CAT:1447: WARN: [Semaphore appears stuck - currently held by 28111]
      000646.tuxa!TMQFORWARD.1112.1.0: gtrid x0 x4fc3912d x72b531: : additional deadlock diagnostic (28248/0/670/4/4/14/-1/28111)
      000648.tuxa!TMS_QM.28117.1.0: gtrid x0 x4fc3912d x72b32c: Q_CAT:1447: WARN: [Semaphore appears stuck - currently held by 28111]
      000648.tuxa!TMS_QM.28117.1.0: gtrid x0 x4fc3912d x72b32c: : additional deadlock diagnostic (28248/0/670/4/4/14/-1/28111)
      000658.tuxa!TMS_QM.27711.1.0: gtrid x0 x4fc3912d x72b1f9: Q_CAT:1447: WARN: [Semaphore appears stuck - currently held by 28111]
      000658.tuxa!TMS_QM.27711.1.0: gtrid x0 x4fc3912d x72b1f9: : additional deadlock diagnostic (28248/0/670/4/4/14/-1/28111)
      000706.tuxa!TMSYSEVT.15205.1.0: LIBTUX_CAT:1485: ERROR: .SysNetworkDropped: pfcs->TuxB connection dropped
      000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72baca: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3
      000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72bacb: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3
      000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72bacc: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3
      000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72bacd: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3
      000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72bace: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3
      000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72bacf: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3
      000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72bad0: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3
      000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72bad1: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3
      000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72bad2: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3
      000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72bad3: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3
      000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72bad4: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3
      000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72bad5: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3
      000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72bad6: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3
      000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72bad7: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3
      000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72bad8: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3
      000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72bad9: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3
      000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72bada: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3
      000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72badb: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3
      000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72badc: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3
      000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72badd: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3
      000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72bade: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3
      000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72badf: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3
      000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72bae0: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3
      000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72bae1: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3
      000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72bae2: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3
      000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72bae3: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3
      000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72bae4: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3
      000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72bae5: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3
      000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72bae6: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3
      000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72bae7: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3
      000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72bae8: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3
      000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72bae9: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3
      000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72baea: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3
      000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72baeb: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3
      000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72baec: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3
      000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72baed: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3

      .... infinit


      so, what can I do to avoid this situation please?

      Tuxedo 9.1 RP 092, 32bit PARISC on ia64 B.11.23

      the queue creation script is:

      ./qcreateTP.sh 11 A


      #!/bin/sh
      IPCKEY=$1`id -u`
      echo IPCKEY = $IPCKEY

      export QMCONFIG=$APPDIR/queues/QUEMSG_TP
      qmadmin $QMCONFIG << END1
      echo
      qspds -f -y QSPACE_TP_$2
      q
      END1

      rm $QMCONFIG
      qmadmin $QMCONFIG << END
      echo
      crdl $QMCONFIG 0 20483
      qspacecreate -n 10240 -A 100
      QSPACE_TP_$2
      $IPCKEY
      10240
      7
      600
      50
      20000
      ERRORQ
      y
      16
      qopen QSPACE_TP_$2
      qcreate -d nonpersist
      TP_REQ
      fifo
      none
      0
      0
      80%
      5%
      sh $APPDIR/scripts/qfull.sh 80 QSPACE_TP_$2 $QMCONFIG TP_REQ
      qcreate -d nonpersist -e +300
      TP_RES_1
      fifo
      none
      0
      0
      80%
      5%
      sh $APPDIR/scripts/qfull.sh 80 QSPACE_TP_$2 $QMCONFIG TP_RES_1
      qcreate -d nonpersist -e +300
      TP_RES_2
      fifo
      none
      0
      0
      80%
      5%
      sh $APPDIR/scripts/qfull.sh 80 QSPACE_TP_$2 $QMCONFIG TP_RES_2
      qcreate -d nonpersist -e +300
      TP_RES_3
      fifo
      none
      0
      0
      80%
      5%
      sh $APPDIR/scripts/qfull.sh 80 QSPACE_TP_$2 $QMCONFIG TP_RES_3
      qcreate -d nonpersist -e +300
      TP_RES_4
      fifo
      none
      0
      0
      80%
      5%
      sh $APPDIR/scripts/qfull.sh 80 QSPACE_TP_$2 $QMCONFIG TP_RES_4
      qcreate -d nonpersist -e +300
      TP_RES_5
      fifo
      none
      0
      0
      80%
      5%
      sh $APPDIR/scripts/qfull.sh 80 QSPACE_TP_$2 $QMCONFIG TP_RES_5
      qcreate -d nonpersist -e +86400
      ERRORQ
      fifo
      none
      0
      0
      80%
      0%
      sh $APPDIR/scripts/qfull.sh 80 QSPACE_TP_$2 $QMCONFIG ERROR
      q
      END

      Edited by: zuchy on 6.6.2012 2:03
        • 1. Re: deadlock when queues are empty and no messages are comming, 100% cpu consum
          Todd Little-Oracle
          Hi,

          Unfortunately the portions of the ULOG you posted don't give an indication of the process that is holding the semaphore. What is the type of system this is running on (architecture, cores, etc.)? Also, have you set a value for SPINCOUNT and if so, what is the value?

          Regards,
          Todd Little
          Oracle Tuxedo Chief Architect
          • 2. Re: deadlock when queues are empty and no messages are comming, 100% cpu consum
            zuchy
            I can't see nothing more about the deadlock in the ulog, on the second machine there were the same behaviour:

            013939.tuxb!TMQUEUE.1649.1.0: gtrid x10001 x4fbf6aa7 x8e666a:     Q_CAT:1447: WARN: [Semaphore appears stuck - currently held by 1646]
            013939.tuxb!TMS_QM.1646.1.0: gtrid x10001 x4fbf6aa7 x8e666a:     Q_CAT:1447: WARN: [Semaphore appears stuck - currently held by 1649]
            013939.tuxb!TMQUEUE.1649.1.0: gtrid x10001 x4fbf6aa7 x8e666a:     : additional deadlock diagnostic (1649/0/670/4/4/18/-1/1646)
            013939.tuxb!TMS_QM.1646.1.0: gtrid x10001 x4fbf6aa7 x8e666a:     : additional deadlock diagnostic (1646/0/670/5/18/4/-1/1649)
            013939.tuxb!TMQFORWARD.1650.1.0: gtrid x10001 x4fbf6aa7 x8e66b4:     Q_CAT:1447: WARN: [Semaphore appears stuck - currently held by 1646]
            013939.tuxb!TMQFORWARD.1650.1.0: gtrid x10001 x4fbf6aa7 x8e66b4:     : additional deadlock diagnostic (1649/0/670/4/4/18/-1/1646)
            013939.tuxb!TMQFORWARD.1652.1.0: gtrid x10001 x4fbf6aa7 x8e66b2:     Q_CAT:1447: WARN: [Semaphore appears stuck - currently held by 1646]
            013939.tuxb!TMQFORWARD.1652.1.0: gtrid x10001 x4fbf6aa7 x8e66b2:     : additional deadlock diagnostic (1649/0/670/4/4/18/-1/1646)
            013939.tuxb!TMQFORWARD.1651.1.0: gtrid x10001 x4fbf6aa7 x8e66b3:     Q_CAT:1447: WARN: [Semaphore appears stuck - currently held by 1646]
            013939.tuxb!TMQFORWARD.1651.1.0: gtrid x10001 x4fbf6aa7 x8e66b3:     : additional deadlock diagnostic (1649/0/670/4/4/18/-1/1646)
            013945.tuxb!TMS_QM.1627.1.0: gtrid x10001 x4fbf6aa7 x8e5c38:     Q_CAT:1447: WARN: [Semaphore appears stuck - currently held by 1646]
            013945.tuxb!TMS_QM.1627.1.0: gtrid x10001 x4fbf6aa7 x8e5c38:     : additional deadlock diagnostic (1649/0/670/4/4/18/-1/1646)

            The machine info is here:

            machinfo
            CPU info:
            Number of CPUs = 4
            Number of enabled CPUs = 4
            Number of enabled sockets = 2
            Cores per socket = 2
            Clock speed = 1415 MHz
            Bus speed = 532 MT/s
            CPUID registers
            vendor information = "GenuineIntel"
            processor serial number = 0x0000000000000000
            processor version info = 0x0000000020000704
            architecture revision: 0
            processor family: 32 Intel(R) Itanium 2 9000 series
            processor model: 0 Intel(R) Itanium 2 9000 series
            Bus features
            implemented = 0xbdf0000020000000
            selected = 0x0000000000000000
            processor revision: 7 Stepping C2
            largest CPUID reg: 4
            processor capabilities = 0x0000000000000005
            implements long branch: 1
            implements 16-byte atomic operations: 1

            Cache info (per core):
            L1 Instruction: size = 16 KB, associativity = 4
            L1 Data: size = 16 KB, associativity = 4
            L2 Instruction: size = 1024 KB, associativity = 8
            L2 Data: size = 256 KB, associativity = 8
            L3 Unified: size = 6144 KB, associativity = 6

            Memory = 20456 MB (19.976562 GB)

            Firmware info:
            Firmware revision = 04.11
            FP SWA driver revision: 1.18
            IPMI is supported on this system.
            ERROR: Unable to obtain manageability firmware revision info.

            Platform info:
            model string = "ia64 hp server rx2660"

            OS info:
            sysname = HP-UX
            release = B.11.23
            version = U (unlimited-user license)
            machine = ia64
            vmunix releaseversion:
            @(#) $Revision: vmunix: B11.23_LR FLAVOR=perf Fri Aug 29 22:35:38 PDT 2003 $


            SPINCOUNT is set to 5000 on each machine:

            tmunloadcf:
            SPINCOUNT=5000

            the whole ULOG I can send using email or some place if you specify me.

            The queue is also used from the /WS client (tpenqueue, tpdequeue).


            thank you,
            jan
            • 3. Re: deadlock when queues are empty and no messages are comming, 100% cpu consum
              695627
              Hi, Zuchy:
              Please try to set environment variable TM_QM_NAPTIME=5000 for TMQUEUE and TMQFORWARD.
              Regards.
              /Pierce
              • 4. Re: deadlock when queues are empty and no messages are comming, 100% cpu consum
                zuchy
                Hi Pierce,
                thank you, it seems to be interesting switch. I'll try this.
                Does anybody know which values of SPINCOUNT and TM_QM_NAPTIME are best of ? Or what are the correct values for the machine I sent info for ?
                I guess 5000 ? Is there any correlation to processor type, speed, number of cores ?
                thank you
                Jan
                • 5. Re: deadlock when queues are empty and no messages are comming, 100% cpu consum
                  Todd Little-Oracle
                  Hi Jan,

                  Unfortunately there is no magic value for SPINCOUNT as it largely depends upon the workload. SPINCOUNT determines how many times Tuxedo will loop trying to obtain a user mode semaphore before resorting to a kernel mode semaphore. Since the amount of time a lock is held and contended for depends largely on what the application is doing, it's virtually impossible to pick a specific value. In general the more cores or CPUs in a system, the higher the value of SPINCOUNT should be, but that is just a very general statement. Values from 5,000 to 100,000 are all reasonable and the only way to determine the optimal value is by systematic testing of the application under load with various values for SPINCOUNT.

                  Regards,
                  Todd Little
                  Oracle Tuxedo Chief Architect