1 Reply Latest reply: Dec 3, 2012 11:07 PM by Deb_1 RSS

    Domain Communication Problems (Frequent Disconnection And Service Failures)

    946987
      Hi,

      We are using Oracle Tuxedo, Version 10.3.0.0, 64-bit, Patch Level 095 on AIX 6.1 power 7 machine. We have four domain (2 domain have MP configuration(master-slave) and 2 individual domains). There are local and remote service published in domains. During test runs we found that doamins keep on disconnecting from each other and not connected again altough we get re-connection message in ULOG.

      Let me present one scenarion. I got following from ULOG.

      071515.uaix3034!GWTDOMAIN.18022630.772.0: LIBGWT_CAT:1130: INFO: Disconnected from domain (domainid=<PATDom2>)
      071515.uaix3034!GWTDOMAIN.18022630.772.0: LIBGWT_CAT:1354: INFO: Retrying domain (domainid=<PATDom2>) every 60 seconds
      071515.uaix3034!GWTDOMAIN.18022630.772.0: LIBGWT_CAT:1502: WARN: Message with TPNOREPLY to service ..TMS dropped - network down
      071515.uaix3034!GWTDOMAIN.18022630.772.0: LIBGWT_CAT:1502: WARN: Message with TPNOREPLY to service ..TMS dropped - network down
      071515.uaix3034!GWTDOMAIN.18022630.772.0: LIBGWT_CAT:1502: WARN: Message with TPNOREPLY to service ..TMS dropped - network down
      071552.uaix3034!GWTDOMAIN.18022630.772.0: LIBGWT_CAT:1502: WARN: Message with TPNOREPLY to service ..TMS dropped - network down
      071552.uaix3034!GWTDOMAIN.18022630.772.0: LIBGWT_CAT:1502: WARN: Message with TPNOREPLY to service ..TMS dropped - network down
      071552.uaix3034!GWTDOMAIN.18022630.772.0: LIBGWT_CAT:1502: WARN: Message with TPNOREPLY to service ..TMS dropped - network down
      071553.uaix3034!GWTDOMAIN.18022630.772.0: LIBGWT_CAT:1129: INFO: Connection established with domain (domainid=<PATDom2>)
      071602.uaix3034!GWTDOMAIN.18022630.772.0: LIBGWT_CAT:1130: INFO: Disconnected from domain (domainid=<PATDom1>)
      071602.uaix3034!GWTDOMAIN.18022630.772.0: LIBGWT_CAT:1354: INFO: Retrying domain (domainid=<PATDom1>) every 60 seconds
      071602.uaix3034!GWTDOMAIN.18022630.772.0: LIBGWT_CAT:1502: WARN: Message with TPNOREPLY to service ..TMS dropped - network down
      071602.uaix3034!GWTDOMAIN.18022630.772.0: LIBGWT_CAT:1502: WARN: Message with TPNOREPLY to service ..TMS dropped - network down
      071602.uaix3034!GWTDOMAIN.18022630.772.0: LIBGWT_CAT:1502: WARN: Message with TPNOREPLY to service ..TMS dropped - network down
      071602.uaix3034!GWTDOMAIN.18022630.772.0: LIBGWT_CAT:1502: WARN: Message with TPNOREPLY to service ..TMS dropped - network down
      071602.uaix3034!GWTDOMAIN.18022630.772.0: LIBGWT_CAT:1502: WARN: Message with TPNOREPLY to service ..TMS dropped - network down
      071602.uaix3034!GWTDOMAIN.18022630.772.0: LIBGWT_CAT:1502: WARN: Message with TPNOREPLY to service ..TMS dropped - network down
      071652.uaix3034!GWTDOMAIN.18022630.772.0: LIBGWT_CAT:1502: WARN: Message with TPNOREPLY to service ..TMS dropped - network down
      071652.uaix3034!GWTDOMAIN.18022630.772.0: LIBGWT_CAT:1502: WARN: Message with TPNOREPLY to service ..TMS dropped - network down
      071652.uaix3034!GWTDOMAIN.18022630.772.0: LIBGWT_CAT:1502: WARN: Message with TPNOREPLY to service ..TMS dropped - network down
      071652.uaix3034!GWTDOMAIN.18022630.772.0: LIBGWT_CAT:1502: WARN: Message with TPNOREPLY to service ..TMS dropped - network down
      071652.uaix3034!GWTDOMAIN.18022630.772.0: LIBGWT_CAT:1502: WARN: Message with TPNOREPLY to service ..TMS dropped - network down
      071652.uaix3034!GWTDOMAIN.18022630.772.0: LIBGWT_CAT:1502: WARN: Message with TPNOREPLY to service ..TMS dropped - network down
      071653.uaix3034!GWTDOMAIN.18022630.772.0: LIBGWT_CAT:1129: INFO: Connection established with domain (domainid=<PATDom1>)

      I get a message that connection is reestablished (last line of log above) but one of the remote service called from remote domain PATDom1 failed with TPESTSTEM Error and it was through only after many retries and after bbclean and pclean was run through tmadmin.

      This is a true OLTP application and outgoing message are not sent in real time and delayed due to service failures.

      I have following questions:

      1/ Does LIBGWT_CAT:1502 point to network between domain being down where as actually this is not the case as it is checked at network level and there is no issue or it points to some other error ?
      2/ How to trace domain communication(service calls across domain) more effectively so that any service failure can be detected early and handled.

      Regards,
      Ajeet Tewari