This discussion is archived
5 Replies Latest reply: Jun 11, 2012 9:02 AM by Todd Little RSS

deadlock when queues are empty and no messages are comming, 100% cpu consum

zuchy Newbie
Currently Being Moderated
Hello tuxedo users,
I am facing to the problem with deadlocks on queues.
The queues are nonpersistent and we have to get the messages from the queue as soon as possible.
So we are using tmqforward with -i -420 (it is 420msec between delays when queue is empty).
The definition of queue in ubb is:

Q_TP1          LMID=TuxA     GRPNO=119     TMSCOUNT=5     TMSNAME=TMS_QM          OPENINFO="TUXEDO/QM:/home/tux-st2/queues/QUEMSG_TP:QSPACE_TP_A"

TMQUEUE               SRVID=1230     MIN=1 MAX=3     SRVGRP=Q_TP1     CLOPT="-o logs/stdout -e logs/stderr -s QSPACE_TP_A,QSPACE_TP:TMQUEUE --"
TMQFORWARD     SRVID=1245     MIN=3 MAX=4 SRVGRP=Q_TP1      GRACE=0 RESTART=Y CONV=N MAXGEN=10 CLOPT="-o logs/stdout -e logs/stderr -- -i -420 -n -q TP_REQ"

The problem is, that if there are no messages comming, there is sometimes (ones per 14days) deadlock detected and TMQUEUE, TMQFORWARD and TMS_QM are from the deadlock time to forever consuming 100% cpu and client will receive TPEBLOCK. Th /WS client is also periodically using signal sending command to interrupt the tpdequeue. The client is trying to send message to request queue - on this queue is tmqforward which i trying to send it asap to the framework and the answer is inserted to reply queue - where the client is listening using tpdequeue TPQWAIT.

This is the problematic log situation:

000646.tuxa!TMQUEUE.28248.1.0: gtrid x0 x4fc3912d x72b519: Q_CAT:1447: WARN: [Semaphore appears stuck - currently held by 28111]
000646.tuxa!TMQUEUE.28248.1.0: gtrid x0 x4fc3912d x72b519: : additional deadlock diagnostic (28248/0/670/4/4/14/-1/28111)
000646.tuxa!TMS_QM.28111.1.0: gtrid x0 x4fc3912d x72b519: Q_CAT:1447: WARN: [Semaphore appears stuck - currently held by 28248]
000646.tuxa!TMS_QM.28111.1.0: gtrid x0 x4fc3912d x72b519: : additional deadlock diagnostic (28111/0/670/5/14/4/-1/28248)
000646.tuxa!TMQFORWARD.1111.1.0: gtrid x0 x4fc3912d x72b51c: Q_CAT:1447: WARN: [Semaphore appears stuck - currently held by 28111]
000646.tuxa!TMQFORWARD.1111.1.0: gtrid x0 x4fc3912d x72b51c: : additional deadlock diagnostic (28248/0/670/4/4/14/-1/28111)
000646.tuxa!TMQFORWARD.835.1.0: gtrid x0 x4fc3912d x72b533: Q_CAT:1447: WARN: [Semaphore appears stuck - currently held by 28111]
000646.tuxa!TMQFORWARD.835.1.0: gtrid x0 x4fc3912d x72b533: : additional deadlock diagnostic (28248/0/670/4/4/14/-1/28111)
000646.tuxa!TMQFORWARD.1112.1.0: gtrid x0 x4fc3912d x72b531: Q_CAT:1447: WARN: [Semaphore appears stuck - currently held by 28111]
000646.tuxa!TMQFORWARD.1112.1.0: gtrid x0 x4fc3912d x72b531: : additional deadlock diagnostic (28248/0/670/4/4/14/-1/28111)
000648.tuxa!TMS_QM.28117.1.0: gtrid x0 x4fc3912d x72b32c: Q_CAT:1447: WARN: [Semaphore appears stuck - currently held by 28111]
000648.tuxa!TMS_QM.28117.1.0: gtrid x0 x4fc3912d x72b32c: : additional deadlock diagnostic (28248/0/670/4/4/14/-1/28111)
000658.tuxa!TMS_QM.27711.1.0: gtrid x0 x4fc3912d x72b1f9: Q_CAT:1447: WARN: [Semaphore appears stuck - currently held by 28111]
000658.tuxa!TMS_QM.27711.1.0: gtrid x0 x4fc3912d x72b1f9: : additional deadlock diagnostic (28248/0/670/4/4/14/-1/28111)
000706.tuxa!TMSYSEVT.15205.1.0: LIBTUX_CAT:1485: ERROR: .SysNetworkDropped: pfcs->TuxB connection dropped
000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72baca: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3
000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72bacb: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3
000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72bacc: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3
000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72bacd: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3
000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72bace: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3
000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72bacf: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3
000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72bad0: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3
000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72bad1: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3
000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72bad2: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3
000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72bad3: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3
000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72bad4: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3
000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72bad5: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3
000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72bad6: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3
000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72bad7: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3
000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72bad8: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3
000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72bad9: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3
000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72bada: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3
000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72badb: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3
000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72badc: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3
000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72badd: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3
000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72bade: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3
000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72badf: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3
000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72bae0: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3
000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72bae1: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3
000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72bae2: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3
000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72bae3: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3
000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72bae4: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3
000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72bae5: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3
000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72bae6: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3
000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72bae7: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3
000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72bae8: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3
000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72bae9: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3
000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72baea: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3
000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72baeb: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3
000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72baec: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3
000708.tuxa!WSH.8075.1.0: gtrid x0 x4fc3912d x72baed: WSNAT_CAT:1300: ERROR: tpcall() call for service QSPACE_TP_A failed, tperrno = 3

.... infinit


so, what can I do to avoid this situation please?

Tuxedo 9.1 RP 092, 32bit PARISC on ia64 B.11.23

the queue creation script is:

./qcreateTP.sh 11 A


#!/bin/sh
IPCKEY=$1`id -u`
echo IPCKEY = $IPCKEY

export QMCONFIG=$APPDIR/queues/QUEMSG_TP
qmadmin $QMCONFIG << END1
echo
qspds -f -y QSPACE_TP_$2
q
END1

rm $QMCONFIG
qmadmin $QMCONFIG << END
echo
crdl $QMCONFIG 0 20483
qspacecreate -n 10240 -A 100
QSPACE_TP_$2
$IPCKEY
10240
7
600
50
20000
ERRORQ
y
16
qopen QSPACE_TP_$2
qcreate -d nonpersist
TP_REQ
fifo
none
0
0
80%
5%
sh $APPDIR/scripts/qfull.sh 80 QSPACE_TP_$2 $QMCONFIG TP_REQ
qcreate -d nonpersist -e +300
TP_RES_1
fifo
none
0
0
80%
5%
sh $APPDIR/scripts/qfull.sh 80 QSPACE_TP_$2 $QMCONFIG TP_RES_1
qcreate -d nonpersist -e +300
TP_RES_2
fifo
none
0
0
80%
5%
sh $APPDIR/scripts/qfull.sh 80 QSPACE_TP_$2 $QMCONFIG TP_RES_2
qcreate -d nonpersist -e +300
TP_RES_3
fifo
none
0
0
80%
5%
sh $APPDIR/scripts/qfull.sh 80 QSPACE_TP_$2 $QMCONFIG TP_RES_3
qcreate -d nonpersist -e +300
TP_RES_4
fifo
none
0
0
80%
5%
sh $APPDIR/scripts/qfull.sh 80 QSPACE_TP_$2 $QMCONFIG TP_RES_4
qcreate -d nonpersist -e +300
TP_RES_5
fifo
none
0
0
80%
5%
sh $APPDIR/scripts/qfull.sh 80 QSPACE_TP_$2 $QMCONFIG TP_RES_5
qcreate -d nonpersist -e +86400
ERRORQ
fifo
none
0
0
80%
0%
sh $APPDIR/scripts/qfull.sh 80 QSPACE_TP_$2 $QMCONFIG ERROR
q
END

Edited by: zuchy on 6.6.2012 2:03
  • 1. Re: deadlock when queues are empty and no messages are comming, 100% cpu consum
    Todd Little Expert
    Currently Being Moderated
    Hi,

    Unfortunately the portions of the ULOG you posted don't give an indication of the process that is holding the semaphore. What is the type of system this is running on (architecture, cores, etc.)? Also, have you set a value for SPINCOUNT and if so, what is the value?

    Regards,
    Todd Little
    Oracle Tuxedo Chief Architect
  • 2. Re: deadlock when queues are empty and no messages are comming, 100% cpu consum
    zuchy Newbie
    Currently Being Moderated
    I can't see nothing more about the deadlock in the ulog, on the second machine there were the same behaviour:

    013939.tuxb!TMQUEUE.1649.1.0: gtrid x10001 x4fbf6aa7 x8e666a:     Q_CAT:1447: WARN: [Semaphore appears stuck - currently held by 1646]
    013939.tuxb!TMS_QM.1646.1.0: gtrid x10001 x4fbf6aa7 x8e666a:     Q_CAT:1447: WARN: [Semaphore appears stuck - currently held by 1649]
    013939.tuxb!TMQUEUE.1649.1.0: gtrid x10001 x4fbf6aa7 x8e666a:     : additional deadlock diagnostic (1649/0/670/4/4/18/-1/1646)
    013939.tuxb!TMS_QM.1646.1.0: gtrid x10001 x4fbf6aa7 x8e666a:     : additional deadlock diagnostic (1646/0/670/5/18/4/-1/1649)
    013939.tuxb!TMQFORWARD.1650.1.0: gtrid x10001 x4fbf6aa7 x8e66b4:     Q_CAT:1447: WARN: [Semaphore appears stuck - currently held by 1646]
    013939.tuxb!TMQFORWARD.1650.1.0: gtrid x10001 x4fbf6aa7 x8e66b4:     : additional deadlock diagnostic (1649/0/670/4/4/18/-1/1646)
    013939.tuxb!TMQFORWARD.1652.1.0: gtrid x10001 x4fbf6aa7 x8e66b2:     Q_CAT:1447: WARN: [Semaphore appears stuck - currently held by 1646]
    013939.tuxb!TMQFORWARD.1652.1.0: gtrid x10001 x4fbf6aa7 x8e66b2:     : additional deadlock diagnostic (1649/0/670/4/4/18/-1/1646)
    013939.tuxb!TMQFORWARD.1651.1.0: gtrid x10001 x4fbf6aa7 x8e66b3:     Q_CAT:1447: WARN: [Semaphore appears stuck - currently held by 1646]
    013939.tuxb!TMQFORWARD.1651.1.0: gtrid x10001 x4fbf6aa7 x8e66b3:     : additional deadlock diagnostic (1649/0/670/4/4/18/-1/1646)
    013945.tuxb!TMS_QM.1627.1.0: gtrid x10001 x4fbf6aa7 x8e5c38:     Q_CAT:1447: WARN: [Semaphore appears stuck - currently held by 1646]
    013945.tuxb!TMS_QM.1627.1.0: gtrid x10001 x4fbf6aa7 x8e5c38:     : additional deadlock diagnostic (1649/0/670/4/4/18/-1/1646)

    The machine info is here:

    machinfo
    CPU info:
    Number of CPUs = 4
    Number of enabled CPUs = 4
    Number of enabled sockets = 2
    Cores per socket = 2
    Clock speed = 1415 MHz
    Bus speed = 532 MT/s
    CPUID registers
    vendor information = "GenuineIntel"
    processor serial number = 0x0000000000000000
    processor version info = 0x0000000020000704
    architecture revision: 0
    processor family: 32 Intel(R) Itanium 2 9000 series
    processor model: 0 Intel(R) Itanium 2 9000 series
    Bus features
    implemented = 0xbdf0000020000000
    selected = 0x0000000000000000
    processor revision: 7 Stepping C2
    largest CPUID reg: 4
    processor capabilities = 0x0000000000000005
    implements long branch: 1
    implements 16-byte atomic operations: 1

    Cache info (per core):
    L1 Instruction: size = 16 KB, associativity = 4
    L1 Data: size = 16 KB, associativity = 4
    L2 Instruction: size = 1024 KB, associativity = 8
    L2 Data: size = 256 KB, associativity = 8
    L3 Unified: size = 6144 KB, associativity = 6

    Memory = 20456 MB (19.976562 GB)

    Firmware info:
    Firmware revision = 04.11
    FP SWA driver revision: 1.18
    IPMI is supported on this system.
    ERROR: Unable to obtain manageability firmware revision info.

    Platform info:
    model string = "ia64 hp server rx2660"

    OS info:
    sysname = HP-UX
    release = B.11.23
    version = U (unlimited-user license)
    machine = ia64
    vmunix releaseversion:
    @(#) $Revision: vmunix: B11.23_LR FLAVOR=perf Fri Aug 29 22:35:38 PDT 2003 $


    SPINCOUNT is set to 5000 on each machine:

    tmunloadcf:
    SPINCOUNT=5000

    the whole ULOG I can send using email or some place if you specify me.

    The queue is also used from the /WS client (tpenqueue, tpdequeue).


    thank you,
    jan
  • 3. Re: deadlock when queues are empty and no messages are comming, 100% cpu consum
    695627 Newbie
    Currently Being Moderated
    Hi, Zuchy:
    Please try to set environment variable TM_QM_NAPTIME=5000 for TMQUEUE and TMQFORWARD.
    Regards.
    /Pierce
  • 4. Re: deadlock when queues are empty and no messages are comming, 100% cpu consum
    zuchy Newbie
    Currently Being Moderated
    Hi Pierce,
    thank you, it seems to be interesting switch. I'll try this.
    Does anybody know which values of SPINCOUNT and TM_QM_NAPTIME are best of ? Or what are the correct values for the machine I sent info for ?
    I guess 5000 ? Is there any correlation to processor type, speed, number of cores ?
    thank you
    Jan
  • 5. Re: deadlock when queues are empty and no messages are comming, 100% cpu consum
    Todd Little Expert
    Currently Being Moderated
    Hi Jan,

    Unfortunately there is no magic value for SPINCOUNT as it largely depends upon the workload. SPINCOUNT determines how many times Tuxedo will loop trying to obtain a user mode semaphore before resorting to a kernel mode semaphore. Since the amount of time a lock is held and contended for depends largely on what the application is doing, it's virtually impossible to pick a specific value. In general the more cores or CPUs in a system, the higher the value of SPINCOUNT should be, but that is just a very general statement. Values from 5,000 to 100,000 are all reasonable and the only way to determine the optimal value is by systematic testing of the application under load with various values for SPINCOUNT.

    Regards,
    Todd Little
    Oracle Tuxedo Chief Architect

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points