This discussion is archived
4 Replies Latest reply: Apr 25, 2013 4:32 AM by 800381 RSS

Cannot stop resource process in solaris cluster

1005344 Newbie
Currently Being Moderated
It is a solaris cluster consists of two T2000 nodes and a disk array. OS version is Solaris 11.1, cluster ware is Solaris cluster 4.1, it is on failover mode.
we deployed jboss application server as resource group in cluster. it is located in global file system(/global/jboss/).

I try to disable jboss resouce when the service responsed slowly by solaris cluster command. but stop service failed. I found it's state of jboss resource is R_STOP_FAILED and Parent pid is 1.

question:
Why the process of jboss server cannot be stoped and it's parent pid is 1?
How to avoid this case?


The following is messages of solaris OS(/var/adm/messages):


Apr 12 14:51:42 NPCLTE-NODE1 Cluster.RGM.global.rgmd: [ID 603096 daemon.notice] resource jbossha-rs disabled.
Apr 12 14:51:42 NPCLTE-NODE1 Cluster.RGM.global.rgmd: [ID 529407 daemon.notice] resource group jbossha-rg state on node NPCLTE-NODE1 change to RG_ON_PENDING_DISABLED
Apr 12 14:51:42 NPCLTE-NODE1 Cluster.RGM.global.rgmd: [ID 224900 daemon.notice] launching method <jboss_mon_stop.ksh> for resource <jbossha-rs>, resource group <jbossha-rg>, node <NPCLTE-NODE1>, timeout <300> seconds
Apr 12 14:51:42 NPCLTE-NODE1 Cluster.RGM.global.rgmd: [ID 515159 daemon.notice] method <jboss_mon_stop.ksh> completed successfully for resource <jbossha-rs>, resource group <jbossha-rg>, node <NPCLTE-NODE1>, time used: 0% of timeout <300 seconds>
Apr 12 14:51:42 NPCLTE-NODE1 Cluster.RGM.global.rgmd: [ID 443746 daemon.notice] resource jbossha-rs state on node NPCLTE-NODE1 change to R_ONLINE_UNMON
Apr 12 14:51:42 NPCLTE-NODE1 Cluster.RGM.global.rgmd: [ID 443746 daemon.notice] resource jbossha-rs state on node NPCLTE-NODE1 change to R_STOPPING
Apr 12 14:51:42 NPCLTE-NODE1 Cluster.RGM.global.rgmd: [ID 224900 daemon.notice] launching method <jboss_svc_stop.ksh> for resource <jbossha-rs>, resource group <jbossha-rg>, node <NPCLTE-NODE1>, timeout <300> seconds
Apr 12 14:51:42 NPCLTE-NODE1 Cluster.RGM.global.rgmd: [ID 784560 daemon.notice] resource jbossha-rs status on node NPCLTE-NODE1 change to R_FM_UNKNOWN
Apr 12 14:51:42 NPCLTE-NODE1 Cluster.RGM.global.rgmd: [ID 922363 daemon.notice] resource jbossha-rs status msg on node NPCLTE-NODE1 change to <Stopping>
Apr 12 14:52:00 NPCLTE-NODE1 automountd[987]: [ID 490373 daemon.error] do_mapent_fedfs: Cannot find entry for .ftpaccess
Apr 12 14:55:42 NPCLTE-NODE1 npc.jboss:1.0,jbossha-rg,jbossha-rs: [ID 980619 daemon.error] Failed to stop jboss with SIGTERM;                     retry with SIGKILL
Apr 12 14:56:27 NPCLTE-NODE1 npc.jboss:1.0,jbossha-rg,jbossha-rs: [ID 620516 daemon.error] Failed to stop jboss; exiting UNSUCCESFUL
Apr 12 14:56:27 NPCLTE-NODE1 Cluster.RGM.global.rgmd: [ID 938318 daemon.error] Method <jboss_svc_stop.ksh> failed on resource <jbossha-rs> in resource group <jbossha-rg> [exit code <1>, time used: 94% of timeout <300 seconds>]
Apr 12 14:56:27 NPCLTE-NODE1 Cluster.RGM.global.rgmd: [ID 443746 daemon.error] resource jbossha-rs state on node NPCLTE-NODE1 change to R_STOP_FAILED
Apr 12 14:56:27 NPCLTE-NODE1 Cluster.RGM.global.rgmd: [ID 784560 daemon.notice] resource jbossha-rs status on node NPCLTE-NODE1 change to R_FM_FAULTED
Apr 12 14:56:27 NPCLTE-NODE1 Cluster.RGM.global.rgmd: [ID 922363 daemon.notice] resource jbossha-rs status msg on node NPCLTE-NODE1 change to <>
Apr 12 14:56:27 NPCLTE-NODE1 Cluster.RGM.global.rgmd: [ID 529407 daemon.error] resource group jbossha-rg state on node NPCLTE-NODE1 change to RG_ERROR_STOP_FAILED
Apr 12 14:56:27 NPCLTE-NODE1 Cluster.RGM.global.rgmd: [ID 870123 daemon.warning] Resource group <jbossha-rg> might require operator attention due to STOP failure
Apr 12 15:06:09 NPCLTE-NODE1 automountd[987]: [ID 490373 daemon.error] do_mapent_fedfs: Cannot find entry for .ftpaccess
Apr 12 15:06:17 NPCLTE-NODE1 last message repeated 2 times
Apr 12 15:13:39 NPCLTE-NODE1 Cluster.RGM.global.rgmd: [ID 529407 daemon.notice] resource group jbossha-rg state on node NPCLTE-NODE1 change to RG_PENDING_OFFLINE
Apr 12 15:13:39 NPCLTE-NODE1 Cluster.RGM.global.rgmd: [ID 529407 daemon.error] resource group jbossha-rg state on node NPCLTE-NODE1 change to RG_PENDING_OFF_STOP_FAILED
Apr 12 15:13:39 NPCLTE-NODE1 Cluster.RGM.global.rgmd: [ID 870123 daemon.warning] Resource group <jbossha-rg> might require operator attention due to STOP failure
Apr 12 15:13:39 NPCLTE-NODE1 Cluster.RGM.global.rgmd: [ID 224900 daemon.notice] launching method <hafoip_monitor_stop> for resource <jbossha-lh-rs>, resource group <jbossha-rg>, node <NPCLTE-NODE1>, timeout <300> seconds
Apr 12 15:13:39 NPCLTE-NODE1 Cluster.RGM.global.rgmd: [ID 224900 daemon.notice] launching method <hastorageplus_monitor_stop> for resource <jbossha-ha-rs>, resource group <jbossha-rg>, node <NPCLTE-NODE1>, timeout <90> seconds
Apr 12 15:13:40 NPCLTE-NODE1 Cluster.RGM.global.rgmd: [ID 515159 daemon.notice] method <hafoip_monitor_stop> completed successfully for resource <jbossha-lh-rs>, resource group <jbossha-rg>, node <NPCLTE-NODE1>, time used: 0% of timeout <300 seconds>
Apr 12 15:13:40 NPCLTE-NODE1 Cluster.RGM.global.rgmd: [ID 443746 daemon.notice] resource jbossha-lh-rs state on node NPCLTE-NODE1 change to R_ONLINE_UNMON
Apr 12 15:13:40 NPCLTE-NODE1 Cluster.RGM.global.rgmd: [ID 443746 daemon.notice] resource jbossha-lh-rs state on node NPCLTE-NODE1 change to R_STOPPING
Apr 12 15:13:40 NPCLTE-NODE1 Cluster.RGM.global.rgmd: [ID 224900 daemon.notice] launching method <hafoip_stop> for resource <jbossha-lh-rs>, resource group <jbossha-rg>, node <NPCLTE-NODE1>, timeout <300> seconds
Apr 12 15:13:40 NPCLTE-NODE1 Cluster.RGM.global.rgmd: [ID 784560 daemon.notice] resource jbossha-lh-rs status on node NPCLTE-NODE1 change to R_FM_UNKNOWN
Apr 12 15:13:40 NPCLTE-NODE1 Cluster.RGM.global.rgmd: [ID 922363 daemon.notice] resource jbossha-lh-rs status msg on node NPCLTE-NODE1 change to <Stopping>
Apr 12 15:13:40 NPCLTE-NODE1 Cluster.RGM.global.rgmd: [ID 515159 daemon.notice] method <hastorageplus_monitor_stop> completed successfully for resource <jbossha-ha-rs>, resource group <jbossha-rg>, node <NPCLTE-NODE1>, time used: 0% of timeout <90 seconds>
Apr 12 15:13:40 NPCLTE-NODE1 Cluster.RGM.global.rgmd: [ID 443746 daemon.notice] resource jbossha-ha-rs state on node NPCLTE-NODE1 change to R_ONLINE_UNMON
Apr 12 15:13:40 NPCLTE-NODE1 ip: [ID 678092 kern.notice] TCP_IOC_ABORT_CONN: local = 192.168.071.168:0, remote = 000.000.000.000:0, start = -2, end = 6
Apr 12 15:13:40 NPCLTE-NODE1 ip: [ID 302654 kern.notice] TCP_IOC_ABORT_CONN: aborted 23 connections
Apr 12 15:13:40 NPCLTE-NODE1 Cluster.RGM.global.rgmd: [ID 784560 daemon.notice] resource jbossha-lh-rs status on node NPCLTE-NODE1 change to R_FM_OFFLINE
Apr 12 15:13:40 NPCLTE-NODE1 Cluster.RGM.global.rgmd: [ID 922363 daemon.notice] resource jbossha-lh-rs status msg on node NPCLTE-NODE1 change to <LogicalHostname offline.>
Apr 12 15:13:40 NPCLTE-NODE1 Cluster.RGM.global.rgmd: [ID 515159 daemon.notice] method <hafoip_stop> completed successfully for resource <jbossha-lh-rs>, resource group <jbossha-rg>, node <NPCLTE-NODE1>, time used: 0% of timeout <300 seconds>
Apr 12 15:13:40 NPCLTE-NODE1 Cluster.RGM.global.rgmd: [ID 443746 daemon.notice] resource jbossha-lh-rs state on node NPCLTE-NODE1 change to R_OFFLINE
Apr 12 15:13:40 NPCLTE-NODE1 Cluster.RGM.global.rgmd: [ID 224900 daemon.notice] launching method <hastorageplus_postnet_stop> for resource <jbossha-ha-rs>, resource group <jbossha-rg>, node <NPCLTE-NODE1>, timeout <1800> seconds
Apr 12 15:13:40 NPCLTE-NODE1 Cluster.RGM.global.rgmd: [ID 784560 daemon.notice] resource jbossha-ha-rs status on node NPCLTE-NODE1 change to R_FM_UNKNOWN
Apr 12 15:13:40 NPCLTE-NODE1 Cluster.RGM.global.rgmd: [ID 922363 daemon.notice] resource jbossha-ha-rs status msg on node NPCLTE-NODE1 change to <Stopping>
Apr 12 15:13:40 NPCLTE-NODE1 proftpd[5759]: NPCLTE-NODE1 (::ffff:192.168.74.51[::ffff:192.168.74.51]) - notice: user root: aborting transfer: No such file or directory
Apr 12 15:13:40 NPCLTE-NODE1 proftpd[7024]: NPCLTE-NODE1 (::ffff:192.168.74.36[::ffff:192.168.74.36]) - error setting TCP_CORK: Invalid argument
Apr 12 15:13:40 NPCLTE-NODE1 proftpd[7156]: NPCLTE-NODE1 (::ffff:192.168.74.29[::ffff:192.168.74.29]) - error setting TCP_CORK: Invalid argument
Apr 12 15:13:40 NPCLTE-NODE1 Cluster.RGM.global.rgmd: [ID 515159 daemon.notice] method <hastorageplus_postnet_stop> completed successfully for resource <jbossha-ha-rs>, resource group <jbossha-rg>, node <NPCLTE-NODE1>, time used: 0% of timeout <1800 seconds>
Apr 12 15:13:40 NPCLTE-NODE1 Cluster.RGM.global.rgmd: [ID 443746 daemon.notice] resource jbossha-ha-rs state on node NPCLTE-NODE1 change to R_OFFLINE
Apr 12 15:13:40 NPCLTE-NODE1 Cluster.RGM.global.rgmd: [ID 784560 daemon.notice] resource jbossha-ha-rs status on node NPCLTE-NODE1 change to R_FM_OFFLINE
Apr 12 15:13:40 NPCLTE-NODE1 Cluster.RGM.global.rgmd: [ID 922363 daemon.notice] resource jbossha-ha-rs status msg on node NPCLTE-NODE1 change to <>
Apr 12 15:13:40 NPCLTE-NODE1 Cluster.RGM.global.rgmd: [ID 529407 daemon.error] resource group jbossha-rg state on node NPCLTE-NODE1 change to RG_ERROR_STOP_FAILED
Apr 12 15:13:40 NPCLTE-NODE1 Cluster.RGM.global.rgmd: [ID 870123 daemon.warning] Resource group <jbossha-rg> might require operator attention due to STOP failure
Apr 12 15:13:41 NPCLTE-NODE1 proftpd[16264]: NPCLTE-NODE1 (::ffff:192.168.74.128[::ffff:192.168.74.128]) - error setting TCP_CORK: Invalid argument
  • 1. Re: Cannot stop resource process in solaris cluster
    Nik Expert
    Currently Being Moderated
    Hi.

    Configured command for stop JBOSS can't stop JBOSS.
    Apr 12 14:55:42 NPCLTE-NODE1 npc.jboss:1.0,jbossha-rg,jbossha-rs: [ID 980619 daemon.error] Failed to stop jboss with SIGTERM; retry with SIGKILL
    Apr 12 14:56:27 NPCLTE-NODE1 npc.jboss:1.0,jbossha-rg,jbossha-rs: [ID 620516 daemon.error] Failed to stop jboss; exiting UNSUCCESFUL
    Apr 12 14:56:27 NPCLTE-NODE1 Cluster.RGM.global.rgmd: [ID 938318 daemon.error] Method <jboss_svc_stop.ksh> failed on resource <jbossha-rs> in resource group <jbossha-rg> [exit code <1>, time used: 94% of timeout <300 seconds>]

    What type resourse used for monitor JBOSS?

    Check that stop script work correctly.

    Regards.
  • 2. Re: Cannot stop resource process in solaris cluster
    1005344 Newbie
    Currently Being Moderated
    Thanks for your attention.

    The stop script can stop process for most case. But sometime the process's parent pid would change to +1+. This script stop process by SIGTERM or SIGKILL. I also try to stop this process by kill commad(kill -9 pid) Manually, but it still exists.
  • 3. Re: Cannot stop resource process in solaris cluster
    Nik Expert
    Currently Being Moderated
    Hi.
    Cluster Software can't resolve application issue ( can't be stoped).

    So you should resolve problem why application can't be stoped. Possible custimise stop script for analyzy changing ppid to 1 and kill process again.

    Regards.
  • 4. Re: Cannot stop resource process in solaris cluster
    800381 Explorer
    Currently Being Moderated
    If you're still in the same state, try running a pstack against the process:
    pstack PID &
    Use the "&" to background the pstack process because if you don't, in this cause you can hang your terminal session. If you're running into what I think you are, the pstack call is likely to get hung in almost the same state as the JBoss process - you could "kill -9" it, but you can't break it or background it with CTRL-C or CTRL-Z to get back to the shell command line

    If so, you're running into a "feature" introduced into Solaris with Solaris 10 - a process gets hung in a way that "kill -9" won't kill it, and running process utilities such as pstack or pmap on the process either produce garbage output or hang the utility just like the hung process.

    I've seen Linux servers do pretty much the same thing. Solaris never did it until Solaris 10 - not sure why.

    The only fix is to reboot the server.

    I guarantee you that will be less painful than trying to go through My Oracle Support and file a bug...

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points