2 Replies Latest reply: Jul 20, 2012 9:31 AM by puchtec RSS

    Status UNREACHABLE on RG=1 RN=3 -- How to bring it back into RUNNING ?

    puchtec
      Hi,

      I found that status of RepNode group 1, node 3 became UNREACHABLE, while Admin Instance on this node works fine.

      Restart of this node also ended up in status UNREACHABLE. Content of logfile (rg1-rn3_0.log):
      07-11-12 15:08:01:275 CEST INFO [rg1-rn3] Starting service process: rg1-rn3, Java command line arguments: [-XX:+UseConcMarkSweepGC, -XX:+UseParNewGC, -XX:+DisableExplicitGC, -XX:NewRatio=18, -XX:SurvivorRatio=4, -XX:MaxTenuringThreshold=10, -XX:CMSInitiatingOccupancyFraction=75, -XX:+UseLargePages]
      07-11-12 15:08:01:317 CEST INFO [rg1-rn3] RepNodeStateUpdateThread started
      07-11-12 15:08:01:344 CEST INFO [rg1-rn3] Partition db cache mode: EVICT_LN
      07-11-12 15:08:01:402 CEST INFO [rg1-rn3] JVM Runtime maxMemory (bytes): 131006464
      07-11-12 15:08:01:403 CEST INFO [rg1-rn3] Non-default JE properties for environment: {je.rep.vlsn.logCacheSize=128, je.rep.envUnknownStateTimeout=10 s, je.tree.maxDelta=10, je.log.faultReadSize=4096, je.rep.vlsn.distance=1000000, je.env.runEvictor=true, je.cleaner.minUtilization=40, je.rep.replicaAckTimeout=5 s, je.cleaner.threads=2, je.evictor.lruOnly=false, je.rep.replayMaxOpenDbHandles=100, je.log.writeQueueSize=2097152, je.evictor.criticalPercentage=105, je.log.iteratorReadSize=1048576, je.log.numBuffers=16, je.nodeMaxEntries=128, je.evictor.nodesPerScan=200, je.rep.preHeartbeatTimeoutMs=5000000000, je.checkpointer.bytesInterval=200000000, je.cleaner.lazyMigration=false, je.txn.durability=write_no_sync,write_no_sync,simple_majority, je.cleaner.readSize=1048576, je.rep.txnRollbackLimit=10, je.log.fileCacheSize=2000, je.log.fileMax=1073741824}
      07-11-12 15:08:01:404 CEST INFO [rg1-rn3] Created new environment dir: /oracle/DEVPGN06/NODE03/PERSO/PERSO/sn3/rg1-rn3/env
      07-11-12 15:08:01:433 CEST INFO [rg1-rn3] Starting RepNodeService
      07-11-12 15:08:01:497 CEST INFO [rg1-rn3] Starting MonitorAgent
      07-11-12 15:08:01:515 CEST INFO [rg1-rn3] RepNodeAdmin registered
      07-11-12 15:08:01:753 CEST INFO [rg1-rn3] JE recovery: POPULATE_UTILIZATION_PROFILE
      07-11-12 15:08:01:779 CEST INFO [rg1-rn3] JE recovery: REMOVE_TEMP_DBS
      07-11-12 15:08:01:803 CEST INFO [rg1-rn3] JE recovery: CKPT
      07-11-12 15:08:01:832 CEST INFO [rg1-rn3] JE recovery: RECOVERY_FINISHED
      07-11-12 15:08:01:862 CEST INFO [rg1-rn3] JE: Started ServiceDispatcher. HostPort=pglddb06.pgdev.lppgrz.com:5046
      07-11-12 15:08:01:911 CEST INFO [rg1-rn3] JE: Current group size: 0
      07-11-12 15:08:01:911 CEST INFO [rg1-rn3] JE: New node rg1-rn3(-1) unknown to rep group
      07-11-12 15:08:01:919 CEST INFO [rg1-rn3] JE: Master changed to rg1-rn1
      07-11-12 15:08:01:920 CEST INFO [rg1-rn3] JE: New node rg1-rn3 located master: rg1-rn1(1)
      07-11-12 15:08:01:923 CEST INFO [rg1-rn3] JE recovery: BECOME_CONSISTENT
      07-11-12 15:08:01:924 CEST INFO [rg1-rn3] JE: Node rg1-rn3 started
      07-11-12 15:08:01:924 CEST INFO [rg1-rn3] JE: Replica loop started with master: rg1-rn1(1)
      07-11-12 15:08:01:934 CEST INFO [rg1-rn3] JE: Replica-feeder handshake start
      07-11-12 15:08:01:965 CEST INFO [rg1-rn3] JE: Replica-feeder rg1-rn1 handshake completed.
      07-11-12 15:08:01:968 CEST INFO [rg1-rn3] JE: Replica-feeder rg1-rn1 syncup started. Replica range: first=-1 last=-1 sync=-1 txnEnd=-1
      07-11-12 15:08:01:968 CEST INFO [rg1-rn3] JE syncup: FIND_MATCHPOINT 0/-1
      07-11-12 15:08:01:976 CEST INFO [rg1-rn3] JE syncup: CHECK_FOR_ROLLBACK 0/-1
      07-11-12 15:08:01:977 CEST INFO [rg1-rn3] JE: Rollback to matchpoint -1 at 0x0/0x0 status=No active txns, nothing to rollback
      07-11-12 15:08:01:977 CEST INFO [rg1-rn3] JE: Replica-feeder rg1-rn1 start stream at VLSN: 1
      07-11-12 15:08:01:978 CEST INFO [rg1-rn3] JE: Replica-feeder rg1-rn1 syncup ended. Elapsed time: 10ms
      07-11-12 15:08:01:978 CEST INFO [rg1-rn3] JE syncup: END 0/-1
      07-11-12 15:08:01:983 CEST INFO [rg1-rn3] JE: Replica initialization completed. Replica VLSN: -1  Heartbeat master commit VLSN: 623 VLSN delta: 624
      07-11-12 15:08:02:34 CEST INFO [rg1-rn3] JE: Joined group as a replica.  join consistencyPolicy=com.sleepycat.je.rep.NoConsistencyRequiredPolicy@1f first=1 last=48 sync=47 txnEnd=47
      07-11-12 15:08:02:36 CEST INFO [rg1-rn3] JE: Refreshed 0 monitors.
      07-11-12 15:08:02:36 CEST INFO [rg1-rn3] State change event: Wed Jul 11 15:08:01 CEST 2012, State: REPLICA, Master: rg1-rn1
      07-11-12 15:08:02:37 CEST INFO [rg1-rn3] Replicated environment handle established. Cache size: 78603878, State: REPLICA
      07-11-12 15:08:02:82 CEST INFO [rg1-rn3] Environment does not contain topology.
      07-11-12 15:08:02:82 CEST INFO [rg1-rn3] Topology needs to be pushed to this node.
      07-11-12 15:08:02:107 CEST INFO [rg1-rn3] Service status changed from STARTING to RUNNING
      07-11-12 15:08:02:107 CEST INFO [rg1-rn3] Started RepNodeService
      07-11-12 15:08:03:141 CEST INFO [rg1-rn3] Topology updated from seq#: 0 to 308
      07-11-12 15:08:03:143 CEST INFO [rg1-rn3] Establishing partition database handles:300
      07-11-12 15:08:03:228 CEST WARNING [rg1-rn3] Number of partitions/rep group:300 configured handles:100 Use policy command to update the replayMaxOpenDbHandles property
      07-11-12 15:08:03:228 CEST INFO [rg1-rn3] Established partition database handles: 300
      07-11-12 15:08:03:585 CEST INFO [rg1-rn3] Topology stored seq#: 308
      07-11-12 15:08:03:778 CEST INFO [rg1-rn3] newParameters: refreshing parameters
      07-11-12 15:47:06:877 CEST INFO [rg1-rn3] Service status changed from RUNNING to STOPPING
      07-11-12 15:47:06:879 CEST INFO [rg1-rn3] RepNodeStateUpdateThread shutdown
      07-11-12 15:47:06:880 CEST INFO [rg1-rn3] Dispatcher shutdown
      07-11-12 15:47:06:880 CEST INFO [rg1-rn3] Shutting down RepNode
      07-11-12 15:47:06:887 CEST INFO [rg1-rn3] JE: Shutting down node rg1-rn3(3)
      07-11-12 15:47:06:888 CEST INFO [rg1-rn3] JE: Refreshed 0 monitors.
      07-11-12 15:47:06:888 CEST INFO [rg1-rn3] JE: Elections shutdown initiated
      07-11-12 15:47:06:921 CEST INFO [rg1-rn3] JE: Elections shutdown completed
      07-11-12 15:47:06:922 CEST INFO [rg1-rn3] JE: Replica IO exception: Expected bytes: 6 read bytes: 0
      java.io.IOException: Expected bytes: 6 read bytes: 0
              at com.sleepycat.je.rep.utilint.BinaryProtocol.fillBuffer(BinaryProtocol.java:367)
              at com.sleepycat.je.rep.utilint.BinaryProtocol.read(BinaryProtocol.java:388)
              at com.sleepycat.je.rep.impl.node.Replica.doRunReplicaLoopInternalWork(Replica.java:443)
              at com.sleepycat.je.rep.impl.node.Replica.runReplicaLoopInternal(Replica.java:374)
              at com.sleepycat.je.rep.impl.node.Replica.runReplicaLoop(Replica.java:309)
              at com.sleepycat.je.rep.impl.node.RepNode.run(RepNode.java:1318)
      
      07-11-12 15:47:06:923 CEST INFO [rg1-rn3] JE: Exiting inner Replica loop.
      07-11-12 15:47:06:925 CEST INFO [rg1-rn3] JE: Replica stats - Lag waits: 0 Lag wait time: 0ms.  VLSN waits: 0 Lag wait time: 0ms.
      07-11-12 15:47:06:925 CEST INFO [rg1-rn3] JE: RepNode main thread shutting down.
      07-11-12 15:47:06:927 CEST INFO [rg1-rn3] State change event: Wed Jul 11 15:47:06 CEST 2012, State: DETACHED, Master: none
      07-11-12 15:47:06:928 CEST INFO [rg1-rn3] JE: RepNode main thread: DETACHED rg1-rn3(3) exited.
      07-11-12 15:47:06:928 CEST INFO [rg1-rn3] JE: ServiceDispatcher shutdown starting. HostPort=pglddb06.pgdev.lppgrz.com:5046 Registered services: []
      07-11-12 15:47:06:929 CEST INFO [rg1-rn3] JE: ServiceDispatcher shutdown completed. HostPort=pglddb06.pgdev.lppgrz.com:5046
      07-11-12 15:47:06:929 CEST INFO [rg1-rn3] JE: rg1-rn3(3) shutdown completed.
      07-11-12 15:47:06:942 CEST INFO [rg1-rn3] RepNodeAdmin stopping
      07-11-12 15:47:06:942 CEST INFO [rg1-rn3] Service status changed from STOPPING to STOPPED
      07-11-12 15:47:06:942 CEST INFO [rg1-rn3] MonitorAgent stopping, 21 items remain.
      07-11-12 15:47:06:943 CEST INFO [rg1-rn3] Saving untransmitted monitor info at shutdown.
      07-11-12 15:47:06:944 CEST INFO [rg1-rn3] Service status changed from RUNNING to STOPPING
      07-11-12 15:47:06:944 CEST INFO [rg1-rn3] Service status: STOPPING 07-11-12 15:47:06
      07-11-12 15:47:06:945 CEST INFO [rg1-rn3] RepNodeStateUpdateThread shutdown
      07-11-12 15:47:06:945 CEST INFO [rg1-rn3] Dispatcher shutdown
      07-11-12 15:47:06:945 CEST INFO [rg1-rn3] Shutting down RepNode
      07-11-12 15:47:06:945 CEST INFO [rg1-rn3] JE: Shutting down node rg1-rn3(3)
      07-11-12 15:47:06:945 CEST INFO [rg1-rn3] JE: Refreshed 0 monitors.
      07-11-12 15:47:06:945 CEST INFO [rg1-rn3] JE: Elections shutdown initiated
      07-11-12 15:47:06:945 CEST INFO [rg1-rn3] JE: Elections shutdown completed
      07-11-12 15:47:06:945 CEST INFO [rg1-rn3] JE: Replica IO exception: Expected bytes: 6 read bytes: 0
      java.io.IOException: Expected bytes: 6 read bytes: 0
              at com.sleepycat.je.rep.utilint.BinaryProtocol.fillBuffer(BinaryProtocol.java:367)
              at com.sleepycat.je.rep.utilint.BinaryProtocol.read(BinaryProtocol.java:388)
              at com.sleepycat.je.rep.impl.node.Replica.doRunReplicaLoopInternalWork(Replica.java:443)
              at com.sleepycat.je.rep.impl.node.Replica.runReplicaLoopInternal(Replica.java:374)
              at com.sleepycat.je.rep.impl.node.Replica.runReplicaLoop(Replica.java:309)
              at com.sleepycat.je.rep.impl.node.RepNode.run(RepNode.java:1318)
      
      07-11-12 15:47:06:946 CEST INFO [rg1-rn3] JE: Exiting inner Replica loop.
      07-11-12 15:47:06:946 CEST INFO [rg1-rn3] JE: Replica stats - Lag waits: 0 Lag wait time: 0ms.  VLSN waits: 0 Lag wait time: 0ms.
      07-11-12 15:47:06:946 CEST INFO [rg1-rn3] JE: RepNode main thread shutting down.
      07-11-12 15:47:06:946 CEST INFO [rg1-rn3] State change event: Wed Jul 11 15:47:06 CEST 2012, State: DETACHED, Master: none
      07-11-12 15:47:06:946 CEST INFO [rg1-rn3] JE: RepNode main thread: DETACHED rg1-rn3(3) exited.
      07-11-12 15:47:06:946 CEST INFO [rg1-rn3] JE: ServiceDispatcher shutdown starting. HostPort=pglddb06.pgdev.lppgrz.com:5046 Registered services: []
      07-11-12 15:47:06:946 CEST INFO [rg1-rn3] JE: ServiceDispatcher shutdown completed. HostPort=pglddb06.pgdev.lppgrz.com:5046
      07-11-12 15:47:06:946 CEST INFO [rg1-rn3] JE: rg1-rn3(3) shutdown completed.
      07-11-12 15:47:06:949 CEST INFO [rg1-rn3] RepNodeAdmin stopping
      07-11-12 15:47:06:950 CEST INFO [rg1-rn3] Service status changed from STOPPING to STOPPED
      07-11-12 15:47:06:950 CEST INFO [rg1-rn3] Service status: STOPPED 07-11-12 15:47:06
      07-11-12 15:47:06:950 CEST INFO [rg1-rn3] MonitorAgent stopping, 21 items remain.
      Any idea how I can bring it back?


      Thanks in advance,
      Frank
        • 1. Re: Status UNREACHABLE on RG=1 RN=3 -- How to bring it back into RUNNING ?
          Linda Lee
          Frank,

          In your log, there is a gap between 15:08 and 15:47, when presumably the node was up, and available to service operations.

          07-11-12 15:08:03:585 CEST INFO [rg1-rn3] Topology stored seq#: 308
          07-11-12 15:08:03:778 CEST INFO [rg1-rn3] newParameters: refreshing parameters
          07-11-12 15:47:06:877 CEST INFO [rg1-rn3] Service status changed from RUNNING to STOPPING

          The STOPPING/STOPPED service status means that a shutdown was issued to this node. This could come about if an expliciit shutdown-repnodes command is issued from the KVAdmin command line, or the web GUI. A shutdown can also be be executed when a change-params plan is run, though in that case, the node should be shutdown and then restarted automatically, transparently to the user.

          Can you investigate the NoSQL DB logs and your own application history to see if there are hints as to what might have happened around 15:47 to start a shutdown? In particular, you may want to look in the admin.log, and/or the <yourStoreName>.log in the logs directory.

          If the repnode stays down, it seems that its status has been changed so it is disabled -- which means that the user intentionally wishes this service to stay down. Otherwise, the node should restart on its own. Try the show-topology and verify commands, from the command line or web GUI, to look at the status of the node, to see if is marked as disabled.

          If it is disabled, you can restart it with the start-repnodes command.
          Linda
          • 2. Re: Status UNREACHABLE on RG=1 RN=3 -- How to bring it back into RUNNING ?
            puchtec
            Linda,

            thanks for the quick response.

            As the KV Store is currently not used by our application, so no application history is available. The KV Store was created and is just up (wihtout any data).

            Thanks for the hint about the remaining "NoSQL DB logs".

            I took a look into the rg1-rn1_0.log and found those lines:
            07-11-12 15:08:03:706 CEST INFO [rg1-rn1] newParameters: refreshing parameters
            07-11-12 15:47:06:923 CEST INFO [rg1-rn1] JE: Shutting down feeder for replica rg1-rn3 Reason: Expected bytes: 6 read bytes: 0 write time:  120ms Avg write time: 41us
            07-11-12 15:47:07:27 CEST INFO [rg1-rn1] JE: Feeder output for replica rg1-rn3 shutdown. feeder VLSN: 626 currentTxnEndVLSN: 625
            07-19-12 17:11:42:899 CEST INFO [rg1-rn1] Service status changed from RUNNING to STOPPING
            07-19-12 17:11:42:916 CEST INFO [rg1-rn1] RepNodeStateUpdateThread shutdown
            At 19/07/2012 the shutdown was initiated by an kvctl stop.

            I also took a look into the admin.log where I could found:
            07-11-12 15:47:06:862 CEST INFO [admin1] Plan 11[Plan-11]:  plan started, 1 tasks
            07-11-12 15:47:06:866 CEST INFO [admin1] Plan 11[Plan-11]: started task 1: StopRepNode
            07-11-12 15:47:06:916 CEST INFO [admin1] New admin web client at 10.51.21.48, session id 13su4sydo5ku51t55zv1v1efb7
            07-11-12 15:47:07:340 CEST INFO [admin1] Removing rg1-rn3 from monitoring
            07-11-12 15:47:07:346 CEST INFO [admin1] Plan 11[Plan-11]: ended task 1, state= SUCCEEDED
            07-11-12 15:47:07:349 CEST INFO [admin1] Plan 11[Plan-11]: plan ended with state SUCCEEDED
            07-11-12 15:47:07:356 CEST INFO [admin1] [sn3] rg1-rn3: Service status: STOPPED 07-11-12 15:47:07
            07-11-12 15:47:14:599 CEST INFO [admin1] Verify: starting for PERSO based upon topology sequence #308, 300 partitions and 3 Storage Nodes, version:11gR2.1.2.123
            07-11-12 15:47:14:600 CEST INFO [admin1] Verify: == checking storage node sn1 ==
            07-11-12 15:47:14:617 CEST INFO [admin1] Verify: Storage Node [sn1] on pglddb06.pgdev.lppgrz.com:5000    Datacenter: PERSO pglddb06.pgdev.lppgrz.com [dc1]    Status: RUNNING   Ver: 11gR2.1.2.123
            07-11-12 15:47:14:621 CEST INFO [admin1] Verify:        Admin [admin1]          Status: RUNNING
            07-11-12 15:47:14:633 CEST INFO [admin1] Verify:        Rep Node [rg1-rn1]      Status: RUNNING,MASTER at sequence number: 625 haPort: 5006
            07-11-12 15:47:14:633 CEST INFO [admin1] Verify: == checking storage node sn2 ==
            07-11-12 15:47:14:645 CEST INFO [admin1] Verify: Storage Node [sn2] on pglddb06.pgdev.lppgrz.com:5020    Datacenter: PERSO pglddb06.pgdev.lppgrz.com [dc1]    Status: RUNNING   Ver: 11gR2.1.2.123
            07-11-12 15:47:14:660 CEST INFO [admin1] Verify:        Admin [admin2]          Status: RUNNING
            07-11-12 15:47:14:671 CEST INFO [admin1] Verify:        Rep Node [rg1-rn2]      Status: RUNNING,REPLICA at sequence number: 625 haPort: 5026
            07-11-12 15:47:14:671 CEST INFO [admin1] Verify: == checking storage node sn3 ==
            07-11-12 15:47:14:683 CEST INFO [admin1] Verify: Storage Node [sn3] on pglddb06.pgdev.lppgrz.com:5040    Datacenter: PERSO pglddb06.pgdev.lppgrz.com [dc1]    Status: RUNNING   Ver: 11gR2.1.2.123
            07-11-12 15:47:14:731 CEST INFO [admin1] Verify:        Admin [admin3]          Status: RUNNING
            07-11-12 15:47:14:732 CEST INFO [admin1] Verify:        Rep Node [rg1-rn3]      Status: UNREACHABLE
            07-11-12 15:47:14:733 CEST INFO [admin1] Verify: finished, configuration valid.
            Seems like that it was removed from monitoring by the web GUI. As we are still "playing" with this KV Store I am not sure if I did this per accident or if a colleague tried out something. (Hence auth is missing in the current version, it's hard to keep this under control, except by using firewalls)

            However, if I start the repnode, the admin will start, but rg1-rn3 will not come up and nothing is going to be logged.

            FINALLY I found a plan "StopRepoNodesPlan" which explains this situation. Now I am trying to bring it up again - just have to look into the documentation :-)

            Thanks, you really helped me,
            Frank