This discussion is archived
2 Replies Latest reply: Jul 20, 2012 7:31 AM by puchtec RSS

Status UNREACHABLE on RG=1 RN=3 -- How to bring it back into RUNNING ?

puchtec Newbie
Currently Being Moderated
Hi,

I found that status of RepNode group 1, node 3 became UNREACHABLE, while Admin Instance on this node works fine.

Restart of this node also ended up in status UNREACHABLE. Content of logfile (rg1-rn3_0.log):
07-11-12 15:08:01:275 CEST INFO [rg1-rn3] Starting service process: rg1-rn3, Java command line arguments: [-XX:+UseConcMarkSweepGC, -XX:+UseParNewGC, -XX:+DisableExplicitGC, -XX:NewRatio=18, -XX:SurvivorRatio=4, -XX:MaxTenuringThreshold=10, -XX:CMSInitiatingOccupancyFraction=75, -XX:+UseLargePages]
07-11-12 15:08:01:317 CEST INFO [rg1-rn3] RepNodeStateUpdateThread started
07-11-12 15:08:01:344 CEST INFO [rg1-rn3] Partition db cache mode: EVICT_LN
07-11-12 15:08:01:402 CEST INFO [rg1-rn3] JVM Runtime maxMemory (bytes): 131006464
07-11-12 15:08:01:403 CEST INFO [rg1-rn3] Non-default JE properties for environment: {je.rep.vlsn.logCacheSize=128, je.rep.envUnknownStateTimeout=10 s, je.tree.maxDelta=10, je.log.faultReadSize=4096, je.rep.vlsn.distance=1000000, je.env.runEvictor=true, je.cleaner.minUtilization=40, je.rep.replicaAckTimeout=5 s, je.cleaner.threads=2, je.evictor.lruOnly=false, je.rep.replayMaxOpenDbHandles=100, je.log.writeQueueSize=2097152, je.evictor.criticalPercentage=105, je.log.iteratorReadSize=1048576, je.log.numBuffers=16, je.nodeMaxEntries=128, je.evictor.nodesPerScan=200, je.rep.preHeartbeatTimeoutMs=5000000000, je.checkpointer.bytesInterval=200000000, je.cleaner.lazyMigration=false, je.txn.durability=write_no_sync,write_no_sync,simple_majority, je.cleaner.readSize=1048576, je.rep.txnRollbackLimit=10, je.log.fileCacheSize=2000, je.log.fileMax=1073741824}
07-11-12 15:08:01:404 CEST INFO [rg1-rn3] Created new environment dir: /oracle/DEVPGN06/NODE03/PERSO/PERSO/sn3/rg1-rn3/env
07-11-12 15:08:01:433 CEST INFO [rg1-rn3] Starting RepNodeService
07-11-12 15:08:01:497 CEST INFO [rg1-rn3] Starting MonitorAgent
07-11-12 15:08:01:515 CEST INFO [rg1-rn3] RepNodeAdmin registered
07-11-12 15:08:01:753 CEST INFO [rg1-rn3] JE recovery: POPULATE_UTILIZATION_PROFILE
07-11-12 15:08:01:779 CEST INFO [rg1-rn3] JE recovery: REMOVE_TEMP_DBS
07-11-12 15:08:01:803 CEST INFO [rg1-rn3] JE recovery: CKPT
07-11-12 15:08:01:832 CEST INFO [rg1-rn3] JE recovery: RECOVERY_FINISHED
07-11-12 15:08:01:862 CEST INFO [rg1-rn3] JE: Started ServiceDispatcher. HostPort=pglddb06.pgdev.lppgrz.com:5046
07-11-12 15:08:01:911 CEST INFO [rg1-rn3] JE: Current group size: 0
07-11-12 15:08:01:911 CEST INFO [rg1-rn3] JE: New node rg1-rn3(-1) unknown to rep group
07-11-12 15:08:01:919 CEST INFO [rg1-rn3] JE: Master changed to rg1-rn1
07-11-12 15:08:01:920 CEST INFO [rg1-rn3] JE: New node rg1-rn3 located master: rg1-rn1(1)
07-11-12 15:08:01:923 CEST INFO [rg1-rn3] JE recovery: BECOME_CONSISTENT
07-11-12 15:08:01:924 CEST INFO [rg1-rn3] JE: Node rg1-rn3 started
07-11-12 15:08:01:924 CEST INFO [rg1-rn3] JE: Replica loop started with master: rg1-rn1(1)
07-11-12 15:08:01:934 CEST INFO [rg1-rn3] JE: Replica-feeder handshake start
07-11-12 15:08:01:965 CEST INFO [rg1-rn3] JE: Replica-feeder rg1-rn1 handshake completed.
07-11-12 15:08:01:968 CEST INFO [rg1-rn3] JE: Replica-feeder rg1-rn1 syncup started. Replica range: first=-1 last=-1 sync=-1 txnEnd=-1
07-11-12 15:08:01:968 CEST INFO [rg1-rn3] JE syncup: FIND_MATCHPOINT 0/-1
07-11-12 15:08:01:976 CEST INFO [rg1-rn3] JE syncup: CHECK_FOR_ROLLBACK 0/-1
07-11-12 15:08:01:977 CEST INFO [rg1-rn3] JE: Rollback to matchpoint -1 at 0x0/0x0 status=No active txns, nothing to rollback
07-11-12 15:08:01:977 CEST INFO [rg1-rn3] JE: Replica-feeder rg1-rn1 start stream at VLSN: 1
07-11-12 15:08:01:978 CEST INFO [rg1-rn3] JE: Replica-feeder rg1-rn1 syncup ended. Elapsed time: 10ms
07-11-12 15:08:01:978 CEST INFO [rg1-rn3] JE syncup: END 0/-1
07-11-12 15:08:01:983 CEST INFO [rg1-rn3] JE: Replica initialization completed. Replica VLSN: -1  Heartbeat master commit VLSN: 623 VLSN delta: 624
07-11-12 15:08:02:34 CEST INFO [rg1-rn3] JE: Joined group as a replica.  join consistencyPolicy=com.sleepycat.je.rep.NoConsistencyRequiredPolicy@1f first=1 last=48 sync=47 txnEnd=47
07-11-12 15:08:02:36 CEST INFO [rg1-rn3] JE: Refreshed 0 monitors.
07-11-12 15:08:02:36 CEST INFO [rg1-rn3] State change event: Wed Jul 11 15:08:01 CEST 2012, State: REPLICA, Master: rg1-rn1
07-11-12 15:08:02:37 CEST INFO [rg1-rn3] Replicated environment handle established. Cache size: 78603878, State: REPLICA
07-11-12 15:08:02:82 CEST INFO [rg1-rn3] Environment does not contain topology.
07-11-12 15:08:02:82 CEST INFO [rg1-rn3] Topology needs to be pushed to this node.
07-11-12 15:08:02:107 CEST INFO [rg1-rn3] Service status changed from STARTING to RUNNING
07-11-12 15:08:02:107 CEST INFO [rg1-rn3] Started RepNodeService
07-11-12 15:08:03:141 CEST INFO [rg1-rn3] Topology updated from seq#: 0 to 308
07-11-12 15:08:03:143 CEST INFO [rg1-rn3] Establishing partition database handles:300
07-11-12 15:08:03:228 CEST WARNING [rg1-rn3] Number of partitions/rep group:300 configured handles:100 Use policy command to update the replayMaxOpenDbHandles property
07-11-12 15:08:03:228 CEST INFO [rg1-rn3] Established partition database handles: 300
07-11-12 15:08:03:585 CEST INFO [rg1-rn3] Topology stored seq#: 308
07-11-12 15:08:03:778 CEST INFO [rg1-rn3] newParameters: refreshing parameters
07-11-12 15:47:06:877 CEST INFO [rg1-rn3] Service status changed from RUNNING to STOPPING
07-11-12 15:47:06:879 CEST INFO [rg1-rn3] RepNodeStateUpdateThread shutdown
07-11-12 15:47:06:880 CEST INFO [rg1-rn3] Dispatcher shutdown
07-11-12 15:47:06:880 CEST INFO [rg1-rn3] Shutting down RepNode
07-11-12 15:47:06:887 CEST INFO [rg1-rn3] JE: Shutting down node rg1-rn3(3)
07-11-12 15:47:06:888 CEST INFO [rg1-rn3] JE: Refreshed 0 monitors.
07-11-12 15:47:06:888 CEST INFO [rg1-rn3] JE: Elections shutdown initiated
07-11-12 15:47:06:921 CEST INFO [rg1-rn3] JE: Elections shutdown completed
07-11-12 15:47:06:922 CEST INFO [rg1-rn3] JE: Replica IO exception: Expected bytes: 6 read bytes: 0
java.io.IOException: Expected bytes: 6 read bytes: 0
        at com.sleepycat.je.rep.utilint.BinaryProtocol.fillBuffer(BinaryProtocol.java:367)
        at com.sleepycat.je.rep.utilint.BinaryProtocol.read(BinaryProtocol.java:388)
        at com.sleepycat.je.rep.impl.node.Replica.doRunReplicaLoopInternalWork(Replica.java:443)
        at com.sleepycat.je.rep.impl.node.Replica.runReplicaLoopInternal(Replica.java:374)
        at com.sleepycat.je.rep.impl.node.Replica.runReplicaLoop(Replica.java:309)
        at com.sleepycat.je.rep.impl.node.RepNode.run(RepNode.java:1318)

07-11-12 15:47:06:923 CEST INFO [rg1-rn3] JE: Exiting inner Replica loop.
07-11-12 15:47:06:925 CEST INFO [rg1-rn3] JE: Replica stats - Lag waits: 0 Lag wait time: 0ms.  VLSN waits: 0 Lag wait time: 0ms.
07-11-12 15:47:06:925 CEST INFO [rg1-rn3] JE: RepNode main thread shutting down.
07-11-12 15:47:06:927 CEST INFO [rg1-rn3] State change event: Wed Jul 11 15:47:06 CEST 2012, State: DETACHED, Master: none
07-11-12 15:47:06:928 CEST INFO [rg1-rn3] JE: RepNode main thread: DETACHED rg1-rn3(3) exited.
07-11-12 15:47:06:928 CEST INFO [rg1-rn3] JE: ServiceDispatcher shutdown starting. HostPort=pglddb06.pgdev.lppgrz.com:5046 Registered services: []
07-11-12 15:47:06:929 CEST INFO [rg1-rn3] JE: ServiceDispatcher shutdown completed. HostPort=pglddb06.pgdev.lppgrz.com:5046
07-11-12 15:47:06:929 CEST INFO [rg1-rn3] JE: rg1-rn3(3) shutdown completed.
07-11-12 15:47:06:942 CEST INFO [rg1-rn3] RepNodeAdmin stopping
07-11-12 15:47:06:942 CEST INFO [rg1-rn3] Service status changed from STOPPING to STOPPED
07-11-12 15:47:06:942 CEST INFO [rg1-rn3] MonitorAgent stopping, 21 items remain.
07-11-12 15:47:06:943 CEST INFO [rg1-rn3] Saving untransmitted monitor info at shutdown.
07-11-12 15:47:06:944 CEST INFO [rg1-rn3] Service status changed from RUNNING to STOPPING
07-11-12 15:47:06:944 CEST INFO [rg1-rn3] Service status: STOPPING 07-11-12 15:47:06
07-11-12 15:47:06:945 CEST INFO [rg1-rn3] RepNodeStateUpdateThread shutdown
07-11-12 15:47:06:945 CEST INFO [rg1-rn3] Dispatcher shutdown
07-11-12 15:47:06:945 CEST INFO [rg1-rn3] Shutting down RepNode
07-11-12 15:47:06:945 CEST INFO [rg1-rn3] JE: Shutting down node rg1-rn3(3)
07-11-12 15:47:06:945 CEST INFO [rg1-rn3] JE: Refreshed 0 monitors.
07-11-12 15:47:06:945 CEST INFO [rg1-rn3] JE: Elections shutdown initiated
07-11-12 15:47:06:945 CEST INFO [rg1-rn3] JE: Elections shutdown completed
07-11-12 15:47:06:945 CEST INFO [rg1-rn3] JE: Replica IO exception: Expected bytes: 6 read bytes: 0
java.io.IOException: Expected bytes: 6 read bytes: 0
        at com.sleepycat.je.rep.utilint.BinaryProtocol.fillBuffer(BinaryProtocol.java:367)
        at com.sleepycat.je.rep.utilint.BinaryProtocol.read(BinaryProtocol.java:388)
        at com.sleepycat.je.rep.impl.node.Replica.doRunReplicaLoopInternalWork(Replica.java:443)
        at com.sleepycat.je.rep.impl.node.Replica.runReplicaLoopInternal(Replica.java:374)
        at com.sleepycat.je.rep.impl.node.Replica.runReplicaLoop(Replica.java:309)
        at com.sleepycat.je.rep.impl.node.RepNode.run(RepNode.java:1318)

07-11-12 15:47:06:946 CEST INFO [rg1-rn3] JE: Exiting inner Replica loop.
07-11-12 15:47:06:946 CEST INFO [rg1-rn3] JE: Replica stats - Lag waits: 0 Lag wait time: 0ms.  VLSN waits: 0 Lag wait time: 0ms.
07-11-12 15:47:06:946 CEST INFO [rg1-rn3] JE: RepNode main thread shutting down.
07-11-12 15:47:06:946 CEST INFO [rg1-rn3] State change event: Wed Jul 11 15:47:06 CEST 2012, State: DETACHED, Master: none
07-11-12 15:47:06:946 CEST INFO [rg1-rn3] JE: RepNode main thread: DETACHED rg1-rn3(3) exited.
07-11-12 15:47:06:946 CEST INFO [rg1-rn3] JE: ServiceDispatcher shutdown starting. HostPort=pglddb06.pgdev.lppgrz.com:5046 Registered services: []
07-11-12 15:47:06:946 CEST INFO [rg1-rn3] JE: ServiceDispatcher shutdown completed. HostPort=pglddb06.pgdev.lppgrz.com:5046
07-11-12 15:47:06:946 CEST INFO [rg1-rn3] JE: rg1-rn3(3) shutdown completed.
07-11-12 15:47:06:949 CEST INFO [rg1-rn3] RepNodeAdmin stopping
07-11-12 15:47:06:950 CEST INFO [rg1-rn3] Service status changed from STOPPING to STOPPED
07-11-12 15:47:06:950 CEST INFO [rg1-rn3] Service status: STOPPED 07-11-12 15:47:06
07-11-12 15:47:06:950 CEST INFO [rg1-rn3] MonitorAgent stopping, 21 items remain.
Any idea how I can bring it back?


Thanks in advance,
Frank
  • 1. Re: Status UNREACHABLE on RG=1 RN=3 -- How to bring it back into RUNNING ?
    Linda Lee Journeyer
    Currently Being Moderated
    Frank,

    In your log, there is a gap between 15:08 and 15:47, when presumably the node was up, and available to service operations.

    07-11-12 15:08:03:585 CEST INFO [rg1-rn3] Topology stored seq#: 308
    07-11-12 15:08:03:778 CEST INFO [rg1-rn3] newParameters: refreshing parameters
    07-11-12 15:47:06:877 CEST INFO [rg1-rn3] Service status changed from RUNNING to STOPPING

    The STOPPING/STOPPED service status means that a shutdown was issued to this node. This could come about if an expliciit shutdown-repnodes command is issued from the KVAdmin command line, or the web GUI. A shutdown can also be be executed when a change-params plan is run, though in that case, the node should be shutdown and then restarted automatically, transparently to the user.

    Can you investigate the NoSQL DB logs and your own application history to see if there are hints as to what might have happened around 15:47 to start a shutdown? In particular, you may want to look in the admin.log, and/or the <yourStoreName>.log in the logs directory.

    If the repnode stays down, it seems that its status has been changed so it is disabled -- which means that the user intentionally wishes this service to stay down. Otherwise, the node should restart on its own. Try the show-topology and verify commands, from the command line or web GUI, to look at the status of the node, to see if is marked as disabled.

    If it is disabled, you can restart it with the start-repnodes command.
    Linda
  • 2. Re: Status UNREACHABLE on RG=1 RN=3 -- How to bring it back into RUNNING ?
    puchtec Newbie
    Currently Being Moderated
    Linda,

    thanks for the quick response.

    As the KV Store is currently not used by our application, so no application history is available. The KV Store was created and is just up (wihtout any data).

    Thanks for the hint about the remaining "NoSQL DB logs".

    I took a look into the rg1-rn1_0.log and found those lines:
    07-11-12 15:08:03:706 CEST INFO [rg1-rn1] newParameters: refreshing parameters
    07-11-12 15:47:06:923 CEST INFO [rg1-rn1] JE: Shutting down feeder for replica rg1-rn3 Reason: Expected bytes: 6 read bytes: 0 write time:  120ms Avg write time: 41us
    07-11-12 15:47:07:27 CEST INFO [rg1-rn1] JE: Feeder output for replica rg1-rn3 shutdown. feeder VLSN: 626 currentTxnEndVLSN: 625
    07-19-12 17:11:42:899 CEST INFO [rg1-rn1] Service status changed from RUNNING to STOPPING
    07-19-12 17:11:42:916 CEST INFO [rg1-rn1] RepNodeStateUpdateThread shutdown
    At 19/07/2012 the shutdown was initiated by an kvctl stop.

    I also took a look into the admin.log where I could found:
    07-11-12 15:47:06:862 CEST INFO [admin1] Plan 11[Plan-11]:  plan started, 1 tasks
    07-11-12 15:47:06:866 CEST INFO [admin1] Plan 11[Plan-11]: started task 1: StopRepNode
    07-11-12 15:47:06:916 CEST INFO [admin1] New admin web client at 10.51.21.48, session id 13su4sydo5ku51t55zv1v1efb7
    07-11-12 15:47:07:340 CEST INFO [admin1] Removing rg1-rn3 from monitoring
    07-11-12 15:47:07:346 CEST INFO [admin1] Plan 11[Plan-11]: ended task 1, state= SUCCEEDED
    07-11-12 15:47:07:349 CEST INFO [admin1] Plan 11[Plan-11]: plan ended with state SUCCEEDED
    07-11-12 15:47:07:356 CEST INFO [admin1] [sn3] rg1-rn3: Service status: STOPPED 07-11-12 15:47:07
    07-11-12 15:47:14:599 CEST INFO [admin1] Verify: starting for PERSO based upon topology sequence #308, 300 partitions and 3 Storage Nodes, version:11gR2.1.2.123
    07-11-12 15:47:14:600 CEST INFO [admin1] Verify: == checking storage node sn1 ==
    07-11-12 15:47:14:617 CEST INFO [admin1] Verify: Storage Node [sn1] on pglddb06.pgdev.lppgrz.com:5000    Datacenter: PERSO pglddb06.pgdev.lppgrz.com [dc1]    Status: RUNNING   Ver: 11gR2.1.2.123
    07-11-12 15:47:14:621 CEST INFO [admin1] Verify:        Admin [admin1]          Status: RUNNING
    07-11-12 15:47:14:633 CEST INFO [admin1] Verify:        Rep Node [rg1-rn1]      Status: RUNNING,MASTER at sequence number: 625 haPort: 5006
    07-11-12 15:47:14:633 CEST INFO [admin1] Verify: == checking storage node sn2 ==
    07-11-12 15:47:14:645 CEST INFO [admin1] Verify: Storage Node [sn2] on pglddb06.pgdev.lppgrz.com:5020    Datacenter: PERSO pglddb06.pgdev.lppgrz.com [dc1]    Status: RUNNING   Ver: 11gR2.1.2.123
    07-11-12 15:47:14:660 CEST INFO [admin1] Verify:        Admin [admin2]          Status: RUNNING
    07-11-12 15:47:14:671 CEST INFO [admin1] Verify:        Rep Node [rg1-rn2]      Status: RUNNING,REPLICA at sequence number: 625 haPort: 5026
    07-11-12 15:47:14:671 CEST INFO [admin1] Verify: == checking storage node sn3 ==
    07-11-12 15:47:14:683 CEST INFO [admin1] Verify: Storage Node [sn3] on pglddb06.pgdev.lppgrz.com:5040    Datacenter: PERSO pglddb06.pgdev.lppgrz.com [dc1]    Status: RUNNING   Ver: 11gR2.1.2.123
    07-11-12 15:47:14:731 CEST INFO [admin1] Verify:        Admin [admin3]          Status: RUNNING
    07-11-12 15:47:14:732 CEST INFO [admin1] Verify:        Rep Node [rg1-rn3]      Status: UNREACHABLE
    07-11-12 15:47:14:733 CEST INFO [admin1] Verify: finished, configuration valid.
    Seems like that it was removed from monitoring by the web GUI. As we are still "playing" with this KV Store I am not sure if I did this per accident or if a colleague tried out something. (Hence auth is missing in the current version, it's hard to keep this under control, except by using firewalls)

    However, if I start the repnode, the admin will start, but rg1-rn3 will not come up and nothing is going to be logged.

    FINALLY I found a plan "StopRepoNodesPlan" which explains this situation. Now I am trying to bring it up again - just have to look into the documentation :-)

    Thanks, you really helped me,
    Frank

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points