1 2 Previous Next 16 Replies Latest reply on Sep 30, 2011 8:25 AM by 695627

    WSNAT_CAT:1042 and/or WSNAT_CAT:1107 causing unresponsive WSL?

    Monty77
      Afternoon all!

      Our applciation is showing the following error on client side:

      [::ReadFields] Tuxedo Service call unsuccessfull [CMHouseService::TuxCallService] tpcall message send failed.
      Service= ViewDevices tperror=13 TPETIME= timeout occured,
      WSNADDR=//172.31.199.8:12001 Tuxedo Service failure

      ..everything is fine for several hours, I check the number of client connections to WSL using ud32 script supplied on this forum before and it is well below max set in tuxconfig.

      MAXWSCLIENTS is set to 1000

      WSL entry looks like:

      *GROUPS
      "WSL2" LMID="vnp6apps" GRPNO=402
      TMSCOUNT=3

      *SERVERS
      "WSL" SRVGRP="WSL2" SRVID=102
      CLOPT="-A -t -- -n //172.31.199.8:12002 -M 9 -T 60 -I 600 -m 1 -x 50 -w WSH"
      RQPERM=0666 REPLYQ=N RPPERM=0666 MIN=1 MAX=1 CONV=N
      SYSTEM_ACCESS=FASTPATH
      MAXGEN=1 GRACE=86400 RESTART=N
      MINDISPATCHTHREADS=0 MAXDISPATCHTHREADS=1 THREADSTACKSIZE=0
      SICACHEENTRIESMAX="500"

      At the time of the crash there are ~200 client connections, sometimes less, well below the threshold to the best of my understanding. Sometimes tmadmin becomes unresponsive too.

      Server side ULOG doesn't show anything useful at the time of the crash but shows a LOT of these:

      231907.vnp6apps!WSH.434530.1.0: WSNAT_CAT:1042: ERROR: tpcall() call failed, tperrno = 6
      232504.vnp6apps!WSH.496036.1.0: WSNAT_CAT:1042: ERROR: tpcall() call failed, tperrno = 6
      233225.vnp6apps!WSH.434530.1.0: WSNAT_CAT:1042: ERROR: tpcall() call failed, tperrno = 6
      233516.vnp6apps!WSH.496036.1.0: WSNAT_CAT:1042: ERROR: tpcall() call failed, tperrno = 6
      234501.vnp6apps!WSH.3461178.1.0: WSNAT_CAT:1042: ERROR: tpcall() call failed, tperrno = 6
      234614.vnp6apps!WSH.2286046.1.0: WSNAT_CAT:1042: ERROR: tpcall() call failed, tperrno = 6
      234654.vnp6apps!WSH.3461178.1.0: WSNAT_CAT:1042: ERROR: tpcall() call failed, tperrno = 6
      234736.vnp6apps!WSH.1937458.1.0: WSNAT_CAT:1042: ERROR: tpcall() call failed, tperrno = 6

      ..and I found these too around the time of the last crash:

      225434.vnp6apps!WSH.3768660.1.0: WSNAT_CAT:1107: ERROR: Error receiving messages, tperrno = TPEOS, Uunixerr = UMSGRCV

      A complete shutdown / restart of tux seems to cure this issue, but it reoccurs down the line.

      Tuxedo version is 9.1 running AIX 5.3

      Any suggestions massively appreciated!
      Thanks,
      Adam
        • 1. Re: WSNAT_CAT:1042 and/or WSNAT_CAT:1107 causing unresponsive WSL?
          Todd Little-Oracle
          Hi Monty,

          What exactly "crashed"? The client received a timeout error (TPETIME) because a request it made timed out (most likely due to not completing before BLOCKTIME was exceeded.

          The tperrno=6 entries in the ULOG indicate that a client made a request to a service that wasn't available. Usually this occurs when there are no servers available offering the service. Can you provide a bit more of your ULOG, especially the entries around the time you got this line:

          225434.vnp6apps!WSH.3768660.1.0: WSNAT_CAT:1107: ERROR: Error receiving messages, tperrno = TPEOS, Uunixerr = UMSGRCV

          That would help in trying to identify what might have gone wrong.

          Offhand I might guess that your System V IPC settings are too low, but that is just guess.

          Regards,
          Todd Little
          Oracle Tuxedo Chief Architect
          • 2. Re: WSNAT_CAT:1042 and/or WSNAT_CAT:1107 causing unresponsive WSL?
            Monty77
            Hi Todd,

            The log files were so huge I couldn't vi them so I trimmed some down and found a lot of these:

            013657.vnp6apps!?proc.4276532.1.0: gtrid x0 x4e3e5705 xa1be9: LIBTUX_CAT:397: ERROR: Cannot allocate entry on GTT
            013657.vnp6apps!?proc.4276532.1.0: gtrid x0 x4e3e5705 xa1bea: LIBTUX_CAT:328: ERROR: No space in Bulletin Board for Transaction T
            able
            013657.vnp6apps!?proc.4276532.1.0: gtrid x0 x4e3e5705 xa1bea: LIBTUX_CAT:397: ERROR: Cannot allocate entry on GTT
            013657.vnp6apps!?proc.4276532.1.0: gtrid x0 x4e3e5705 xa1beb: LIBTUX_CAT:328: ERROR: No space in Bulletin Board for Transaction T
            able

            and these:

            011107.vnp6apps!TMQUEUE.2105430.1.0: gtrid x0 x4e3e5705 x99c6f: LIBTUX_CAT:481: ERROR: Service xa_start returned XAER_RMERR
            011107.vnp6apps!TMQUEUE.2105430.1.0: gtrid x0 x4e3e5705 x99c70: Q_CAT:1472: ERROR: xa_start() - no more tran table entries (max = 1
            0)
            011107.vnp6apps!TMQUEUE.2105430.1.0: gtrid x0 x4e3e5705 x99c70: LIBTUX_CAT:481: ERROR: Service xa_start returned XAER_RMERR
            011107.vnp6apps!TMQUEUE.2105430.1.0: gtrid x0 x4e3e5705 x99c71: Q_CAT:1472: ERROR: xa_start() - no more tran table entries (max = 1
            0)
            011107.vnp6apps!TMQUEUE.2105430.1.0: gtrid x0 x4e3e5705 x99c71: LIBTUX_CAT:481: ERROR: Service xa_start returned XAER_RMERR
            011109.vnp6apps!TMQUEUE.2105430.1.0: gtrid x0 x4e3e5705 x99c74: Q_CAT:1472: ERROR: xa_start() - no more tran table entries (max = 1
            0)

            ..so I increase MAXGTT ? It's currently set to 500 - the same number as it was previously without issue but this new version of our application has more servers/services so the load would be greater I suppose.

            What is MAXGTT for? And how do I monitor it's capacity? i.e. how do I tell when it's approaching the limit that might trigger the CAT:328 errors?

            thanks!
            Adam
            • 3. Re: WSNAT_CAT:1042 and/or WSNAT_CAT:1107 causing unresponsive WSL?
              Todd Little-Oracle
              Hi Adam,

              First, sorry for the name mix up!

              The GTT is the Global Transaction Table. It is a section of the Bulletin Board (BB) that tracks the participants in a distributed (global) transaction. Each active transaction on the machine takes up an entry in the GTT. So the GTT must be at least as large as the maximum number of concurrent global transactions you expect to handle on the machine.

              My guess is that is why most of these other errors are occurring, although it is possible that there are other problems as well. What is your setting for BLOCKTIME and SCANUNIT in your UBBCONFIG file? These determine the point at which requests time out.

              As to monitoring the GTT, you can either use the tmadmin subcommand printtrans specifying the machine you wish to monitor, or you can also use the Tuxedo .TMIB service to make a MIB request getting that information from the T_TRANSACTION class.

              Regards,
              Todd Little
              Oracle Tuxedo Chief Architect
              • 4. Re: WSNAT_CAT:1042 and/or WSNAT_CAT:1107 causing unresponsive WSL?
                Monty77
                Hi Todd,

                No problem, my fault entirely!

                BLOCKTIME = 6
                SCANUNIT = 10

                ..a sample from printtrans right now:
                index=0 gtrid=x0 x4e4f5d7d x1a7f8c
                : Machine id: vnp6apps, Transaction status: TMGACTIVE
                Group count: 1, timeout: 40, time left: 49
                Known participants:
                group: HSDS10, status: TMGACTIVE, remote, coord
                index=1 gtrid=x0 x4e4f5d7d x1a7f8b
                : Machine id: vnp6apps, Transaction status: TMGACTIVE
                Group count: 1, timeout: 40, time left: 49
                Known participants:
                group: HSDS10, status: TMGACTIVE, remote, coord
                index=2 gtrid=x0 x4e4f5d7d x1a7f8a
                : Machine id: vnp6apps, Transaction status: TMGACTIVE
                Group count: 1, timeout: 999, time left: 1008
                Known participants:
                group: RC10, status: TMGACTIVE, local, coord
                index=3 gtrid=x0 x4e4f5d7d x1a7f7e
                : Machine id: vnp6apps, Transaction status: TMGACTIVE
                Group count: 1, timeout: 99, time left: 108
                Known participants:
                group: RTCP10, status: TMGACTIVE, local, coord
                index=4 gtrid=x0 x4e4f5d7d x1a7f7d
                : Machine id: vnp6apps, Transaction status: TMGABORTED
                Group count: 1, timeout: 999, time left: 60
                Known participants:
                group: RC10, status: TMGACTIVE, local, coord
                index=5 gtrid=x0 x4e4f5d7d x1a7f48
                : Machine id: vnp6apps, Transaction status: TMGACTIVE
                Group count: 1, timeout: 2147483647, time left: 2147483637
                Known participants:
                group: c46010, status: TMGACTIVE, local, coord

                Some of these have different timeouts, and they look to be very long, so where are they defined? Or are they defaults and the ones with different timeouts are set elsewhere?

                After increasing MAXGTT from 500 to 1000 the errors have largely stopped .. I'm guessing that there is some spike in activity combined with a busy system that meant more than 500 items were in the GTT at one time, the mean value over the last 24 hours seems to be <50.

                I'm still seeing a ton of these in the ULOG though:

                170156.vnp6apps!WSH.1651052.1.0: WSNAT_CAT:1042: ERROR: tpcall() call failed, tperrno = 6
                170205.vnp6apps!WSH.1147054.1.0: WSNAT_CAT:1042: ERROR: tpcall() call failed, tperrno = 6
                170211.vnp6apps!WSH.1651052.1.0: WSNAT_CAT:1042: ERROR: tpcall() call failed, tperrno = 6
                170217.vnp6apps!WSH.1888392.1.0: WSNAT_CAT:1042: ERROR: tpcall() call failed, tperrno = 6
                170219.vnp6apps!WSH.1667102.1.0: WSNAT_CAT:1042: ERROR: tpcall() call failed, tperrno = 6
                170219.vnp6apps!WSH.1179684.1.0: WSNAT_CAT:1042: ERROR: tpcall() call failed, tperrno = 6
                170220.vnp6apps!WSH.1888392.1.0: WSNAT_CAT:1042: ERROR: tpcall() call failed, tperrno = 6

                Google tells me that it might be missing TMSNAME from the *GROUPS section for my WSL:

                *GROUPS
                "broker" LMID="vnp6apps" GRPNO=400
                TMSCOUNT=2
                "WSL" LMID="vnp6apps" GRPNO=401
                TMSCOUNT=3
                "WSL2" LMID="vnp6apps" GRPNO=402
                TMSCOUNT=3

                ...I don't have one setup, but then I never did before and didnt see the 1042 errors.

                Could it be anything else?

                Thanks!
                Adam
                Todd Little wrote:
                Hi Adam,

                First, sorry for the name mix up!

                The GTT is the Global Transaction Table. It is a section of the Bulletin Board (BB) that tracks the participants in a distributed (global) transaction. Each active transaction on the machine takes up an entry in the GTT. So the GTT must be at least as large as the maximum number of concurrent global transactions you expect to handle on the machine.

                My guess is that is why most of these other errors are occurring, although it is possible that there are other problems as well. What is your setting for BLOCKTIME and SCANUNIT in your UBBCONFIG file? These determine the point at which requests time out.

                As to monitoring the GTT, you can either use the tmadmin subcommand printtrans specifying the machine you wish to monitor, or you can also use the Tuxedo .TMIB service to make a MIB request getting that information from the T_TRANSACTION class.

                Regards,
                Todd Little
                Oracle Tuxedo Chief Architect
                • 5. Re: WSNAT_CAT:1042 and/or WSNAT_CAT:1107 causing unresponsive WSL?
                  Todd Little-Oracle
                  Hi Adam,

                  The timeout time for a transaction can either be specified when the transaction is started, or in the case of AUTOTRAN services the value specified by the TRANTIME parameter in the RESOURCES or SERVICES sections of the UBBCONFIG file. Some of those transactions have timeouts that on the order of 15 minutes or more (with one virtually infinity!), which seems very high for a high volume TP system. Although having 500 or more outstanding transactions isn't a problem, it does seem a little high. It largely depends upon how much concurrent work you expect your system to be able to handle.

                  errno=6 is a TPENOENT error, which basically means some workstation client (remote client) tried to call a service that didn't exist. Now there may be a problem with the client, or it's possible all copies of the server(s) that offer the service are down for some reason. I suppose it is possible that it is related to TMS processes, but that would be odd. It's also curious that you have TMSCOUNT defined for some groups, but no TMS named, i.e., no TMSNAME or OPENINFO parameters on the group. Do the servers in those groups access a resource manager such as a database?

                  Regards,
                  Todd Little
                  Oracle Tuxedo Chief Architect
                  • 6. Re: WSNAT_CAT:1042 and/or WSNAT_CAT:1107 causing unresponsive WSL?
                    Todd Little-Oracle
                    Hi Adam,

                    One thing that might help debug the TPENOENT errors is to set an environment variable named TMULOGUSINGSERVICENAME to any value before booting your application. That will cause an entry in the ULOG that indicates the name of the service that was requested but wasn't found.

                    Regards,
                    Todd Little
                    Oracle Tuxedo Chief Architect

                    PS Thanks to Pierce Shi for digging up that little tidbit!
                    • 7. Re: WSNAT_CAT:1042 and/or WSNAT_CAT:1107 causing unresponsive WSL?
                      Monty77
                      Todd Little wrote:
                      Hi Adam,

                      One thing that might help debug the TPENOENT errors is to set an environment variable named TMULOGUSINGSERVICENAME to any value before booting your application. That will cause an entry in the ULOG that indicates the name of the service that was requested but wasn't found.

                      Regards,
                      Todd Little
                      Oracle Tuxedo Chief Architect

                      PS Thanks to Pierce Shi for digging up that little tidbit!
                      Hi Todd,

                      Thanks for that! I'm assuming I can put that in the ENVFILE defined in the UBB? In the meantime I'll check the services section against our template to make sure everything is there.

                      Thanks!
                      Adam
                      • 8. Re: WSNAT_CAT:1042 and/or WSNAT_CAT:1107 causing unresponsive WSL?
                        Todd Little-Oracle
                        Hi Adam,

                        You should be able to use the ENVFILE on the WSL entries, but if for some reason that doesn't work, simply define the environment variable in your shell before booting the application.

                        Regards,
                        Todd Little
                        Oracle Tuxedo Chief Architect
                        • 9. Re: WSNAT_CAT:1042 and/or WSNAT_CAT:1107 causing unresponsive WSL?
                          Monty77
                          Todd Little wrote:
                          Hi Adam,

                          You should be able to use the ENVFILE on the WSL entries, but if for some reason that doesn't work, simply define the environment variable in your shell before booting the application.

                          Regards,
                          Todd Little
                          Oracle Tuxedo Chief Architect
                          Hi Todd,

                          I finally got the changes made in production as recommended and the 1042s have now turned into this:

                          141601.vnp6apps!WSH.1933788.1.0: WSNAT_CAT:1300: ERROR: tpcall() call for service ..TMSYSMIB failed, tperrno = 6
                          141608.vnp6apps!WSH.1581262.1.0: WSNAT_CAT:1300: ERROR: tpcall() call for service ..TMSYSMIB failed, tperrno = 6

                          ...what is ..TMSYSMIB ? When I do psc -a 2 I dont see it, which is probably why I see the error, but what is it?

                          The MAXGTT error has gone since the parameter change but I'm still seeing a LOT of these just before the WSL becomes unresponsive:

                          024808.vnp6apps!TMQUEUE.3403902.1.0: gtrid x0 x4e57d635 x27c3f8:     Q_CAT:1472: ERROR: xa_start() - no more tran table entries (max = 10)
                          024808.vnp6apps!TMQUEUE.3403902.1.0: gtrid x0 x4e57d635 x27c3f8:     LIBTUX_CAT:481: ERROR: Service xa_start returned XAER_RMERR

                          tran table entries (max = 10)

                          ... what is that referring to? And where is the limit set?

                          Thanks!
                          Adam
                          • 10. Re: WSNAT_CAT:1042 and/or WSNAT_CAT:1107 causing unresponsive WSL?
                            Monty77
                            OK, so I've done a little more digging, boy tuxedo error information is hard to find!!

                            Looks like the limit is set in the queue itself on creation, and for whatever reason we're setting it to 10. We always have done so I'm curious to know why it's suddenly an issue. Slow DB performance perhaps meaning more is hanging about that would normally?

                            For now I think I'll increase the queue limits to 50 and see how we go, this is one of our largest customers so perhaps they're hitting limits nobody else has seen yet.

                            Thanks!
                            Adam
                            • 11. Re: WSNAT_CAT:1042 and/or WSNAT_CAT:1107 causing unresponsive WSL?
                              Todd Little-Oracle
                              Hi Adam,

                              My guess is that you are just hitting the next limit in your configuration.

                              As for finding error information, you should be able to easily find the description and probable solution in the Tuxedo documentation. If you go to the Tuxedo documentation home page, there is an entry called "Messages". When you select that, you'll get to a page that divides the error messages into their categories. In your case, Q_CAT:1472 indicates that the category is "Q". If you select that category, you'll be presented with a list of error number ranges. If you select 1400-1499, you'll get to a page that has a description of the error and recommended action such as:

                              >
                              1472

                              ERROR: xa_start() - no more tran table entries (max = entries)

                              Description

                              xa_start was called to start a new transaction but no transaction entries are free in the queue space.

                              Action

                              Execute the qprinttrans command to see what transactions are currently outstanding. Use qcommittrans or qaborttrans to complete transactions that are in READY state but not completed.

                              See Also

                              qmadmin(1)
                              >

                              Unfortunately in this case, the Action doesn't tell you specifically what parameter needs to change. If you have time, please submit documentation feedback from the link provided at the bottom of the page.

                              Glad to hear you've gotten this mostly resolved! :-)

                              Regards,
                              Todd Little
                              Oracle Tuxedo Chief Architect
                              • 12. Re: WSNAT_CAT:1042 and/or WSNAT_CAT:1107 causing unresponsive WSL?
                                Monty77
                                Hi Todd,

                                I'm still working my way through this one, increasing the transaction limits on the queue spaces definitely seems to help but I still have 2 problems:

                                1 - The WSNAT_CAT:1300s

                                141601.vnp6apps!WSH.1933788.1.0: WSNAT_CAT:1300: ERROR: tpcall() call for service ..TMSYSMIB failed, tperrno = 6
                                141608.vnp6apps!WSH.1581262.1.0: WSNAT_CAT:1300: ERROR: tpcall() call for service ..TMSYSMIB failed, tperrno = 6

                                ...what is ..TMSYSMIB ? When I do psc -a 2 I dont see it, which is probably why I see the error, but what is it?

                                2 - Unresponsive tuxedo in general

                                I saw it for myself tonight for the first time (rather than customer reporting in). tmadmin / tmboot / tmshutdown etc.. all unresponsive, nothing writing to the ULOG. Where do I begin with a hang like that if it's not writing anything anywhere, suggestions welcome :)

                                Thanks!
                                Adam

                                Todd Little wrote:
                                Hi Adam,

                                My guess is that you are just hitting the next limit in your configuration.

                                As for finding error information, you should be able to easily find the description and probable solution in the Tuxedo documentation. If you go to the Tuxedo documentation home page, there is an entry called "Messages". When you select that, you'll get to a page that divides the error messages into their categories. In your case, Q_CAT:1472 indicates that the category is "Q". If you select that category, you'll be presented with a list of error number ranges. If you select 1400-1499, you'll get to a page that has a description of the error and recommended action such as:

                                >
                                1472

                                ERROR: xa_start() - no more tran table entries (max = entries)

                                Description

                                xa_start was called to start a new transaction but no transaction entries are free in the queue space.

                                Action

                                Execute the qprinttrans command to see what transactions are currently outstanding. Use qcommittrans or qaborttrans to complete transactions that are in READY state but not completed.

                                See Also

                                qmadmin(1)
                                >

                                Unfortunately in this case, the Action doesn't tell you specifically what parameter needs to change. If you have time, please submit documentation feedback from the link provided at the bottom of the page.

                                Glad to hear you've gotten this mostly resolved! :-)

                                Regards,
                                Todd Little
                                Oracle Tuxedo Chief Architect
                                • 13. Re: WSNAT_CAT:1042 and/or WSNAT_CAT:1107 causing unresponsive WSL?
                                  695627
                                  Hi, Adam:
                                  This is Pierce Shi, Tuxedo SE manager.
                                  Can you tell me the patch level?
                                  Regards.
                                  /Pierce
                                  • 14. Re: WSNAT_CAT:1042 and/or WSNAT_CAT:1107 causing unresponsive WSL?
                                    695627
                                    Hi, Adam:
                                    This is Pierce Shi, Tuxedo SE manager.
                                    Can you tell me the patch level?
                                    Regards.
                                    /Pierce
                                    1 2 Previous Next