This discussion is archived
4 Replies Latest reply: Jan 18, 2013 8:54 AM by chpruvos RSS

Explain Behavior when killing some processes...

chpruvos Newbie
Currently Being Moderated
Hi all

My config

kv-> sho topology
store=mystore numPartitions=100 sequence=107
dc=[dc1] name=MyDC repFactor=1

sn=[sn1] dc=dc1 PRUVOST-PC.fr.oracle.com:5000 capacity=1 RUNNING
[rg1-rn1] RUNNING
No performance info available
sn=[sn2] dc=dc1 PRUVOST-PC.fr.oracle.com:5020 capacity=1 RUNNING
[rg2-rn1] RUNNING
No performance info available

shard=[rg1] num partitions=50
[rg1-rn1] sn=sn1
shard=[rg2] num partitions=50
[rg2-rn1] sn=sn2

On windows I kill the console where I launch the SNA --> java -jar ..\lib\kvstore-2.0.23.jar start -root root2

Than with my client on eclipse i got an error when trying to do a get

Exception in thread "main" oracle.kv.RequestTimeoutException: Request dispatcher:c-7176710171132897515, dispatch timed out after 0 retries. Target:rg2-rn1 Timeout:5000ms (11.2.2.0.23)
Fault class name: oracle.kv.RequestTimeoutException
     at oracle.kv.impl.api.RequestDispatcherImpl.execute(RequestDispatcherImpl.java:545)
     at oracle.kv.impl.api.RequestDispatcherImpl.execute(RequestDispatcherImpl.java:939)
     at oracle.kv.impl.api.KVStoreImpl.executeRequest(KVStoreImpl.java:1223)
     at oracle.kv.impl.api.KVStoreImpl.get(KVStoreImpl.java:250)
     at oracle.kv.impl.api.KVStoreImpl.get(KVStoreImpl.java:235)
     at test.NoSQLTest.getTest(NoSQLTest.java:114)
     at test.NoSQLTest.main(NoSQLTest.java:64)
Caused by: java.lang.IllegalStateException: Could not establish handle to rg2-rn1
     at oracle.kv.impl.api.RequestDispatcherImpl.execute(RequestDispatcherImpl.java:492)
     ... 6 more

My initialization code...below

private String[] hhosts = {"localhost:5000","localhost:5020"};
     KVStoreConfig storeConfig = null;
     KVStore store = null;
     
     public NoSQLTest(String consistency, String durability) {
          storeConfig = new KVStoreConfig("mystore", hhosts);
          Durability defaultDurability = null;
          
          if (consistency.equals("None")) {
               storeConfig.setConsistency(Consistency.NONE_REQUIRED);
          }
          else if (consistency.equals("Absolute")) {
               storeConfig.setConsistency(Consistency.ABSOLUTE);
          }
          else {
               System.out.println("ERROR - Define a good consistency");
               System.exit(0);
          }
          
          if (durability.equals("No_Sync")) {
               defaultDurability = new Durability(Durability.SyncPolicy.NO_SYNC, // Master sync
               Durability.SyncPolicy.NO_SYNC, // Replica sync
               Durability.ReplicaAckPolicy.NONE);
               storeConfig.setDurability(defaultDurability);
          }
          else if (durability.equals("Sync")) {
               defaultDurability = new Durability(Durability.SyncPolicy.SYNC, // Master sync
               Durability.SyncPolicy.SYNC, // Replica sync
               Durability.ReplicaAckPolicy.NONE);
               storeConfig.setDurability(defaultDurability);
          }
          else {
               System.out.println("ERROR - Define a good durability");
               System.exit(0);
          }
          
          store = KVStoreFactory.getStore(storeConfig);
     }
  • 1. Re: Explain Behavior when killing some processes...
    gmfeinberg Journeyer
    Currently Being Moderated
    Hi,

    This is somewhat related to your recent post on your other thread about the processes present and should also explain the 2 extra processes.

    Your configuration has 2 storage nodes/SNAs. Each SNA is managing a single RepNode, which is the process that serves client requests. Because your replication factor is 1 that means that the keys you store are hashed into 2 shards, each one hosted by one of the RepNodes. You cannot directly control which shard will get a particular record.

    When you did this:
    "On windows I kill the console where I launch the SNA --> java -jar ..\lib\kvstore-2.0.23.jar start -root root2"
    You also killed the RepNode on that SNA which is why your client can no longer. The SNA (Storage Node Agent) process "manages" the RepNode(s) on the storage node as well as any admin processes. This means it will automatically restart them if they fail. It also means that if you kill the SNA it will also kill its managed processes. The SNA processes are the "extra" processes you saw that you mentioned in your other thread.

    I assume you intended to do this to test availability, but the problem is that you also effectively eliminated the ability to access 1/2 of your keys, including the one in your test client. That explains the client failure.

    In order to get true availability in the face of RepNode/SNA failure, you need a replication factor of at least 3 to ensure that there is a replica available to take over if one fails.

    Regards,

    George
  • 2. Re: Explain Behavior when killing some processes...
    chpruvos Newbie
    Currently Being Moderated
    Ok ...

    1) So does it means that : if we are in a situation where Oracle NoSQL cannot access to 100% of the data then a client cannot connect ?

    Is it a choice for the implementation because another choice could be to allow to the client to connect but the it can work with only 50% of the data ?

    2) I am not sure to understand why a replication factor of 3 is the solution for availability. If we loose a shard = one storage node + all replicated nodes associated to this storage nodes (may be 3) [is this a shard ?] then do we have access to 100% of data ? may be when a data arrive on a storage node on a server then the replicated data can go on a replicated node on another server or on the same server ?

    Thank you for your very interesting explaination...

    Christophe.
  • 3. Re: Explain Behavior when killing some processes...
    gmfeinberg Journeyer
    Currently Being Moderated
    Christophe,

    First, it will help if you read this section of the admin guide that describes shards and replication.
    http://docs.oracle.com/cd/NOSQL/html/AdminGuide/introduction.html#kvstore
    1) So does it means that : if we are in a situation where Oracle NoSQL cannot access to 100% of the data then a client cannot connect ?
    No, it just cannot connect to specific node that is hosting the data to be accessed. Based on the stack trace you posted you did not post your entire program. Your stack had this in it:
    at oracle.kv.impl.api.KVStoreImpl.get(KVStoreImpl.java:250)
    at oracle.kv.impl.api.KVStoreImpl.get(KVStoreImpl.java:235)
    at test.NoSQLTest.getTest(NoSQLTest.java:114)
    at test.NoSQLTest.main(NoSQLTest.java:64)

    This implies you had already connected and obtained a KVStore object. The operation that failed was a KVStore.get(), which is not shown in your code. That get() is presumably attempting to get a key that is in one of the partitions in the shard that you killed. If you attempted to get a record from the running shard it would have worked.

    >
    Is it a choice for the implementation because another choice could be to allow to the client to connect but the it can work with only 50% of the data ?
    It will access the other data, as I mentioned. But you cannot really control the hash function that determines which shard hosts a particular key. The client is "smart" in that it will route a given request directly to the shard that is hosting the data, which is why your get() failed in that manner. Even if the request had gotten to the other shard it still would have failed because the node hosting the data is down and there is no replica.

    >
    2) I am not sure to understand why a replication factor of 3 is the solution for availability. If we loose a shard = one storage node + all replicated nodes associated to this storage nodes (may be 3) [is this a shard ?] then do we have access to 100% of data ? may be when a data arrive on a storage node on a server then the replicated data can go on a replicated node on another server or on the same server ?
    As explained in the documentation, at the high level, a store is split into a number of partitions, as determined when you create the store. In your case 100 partitions. Based on the number and capacity of storage nodes you've provisioned, as well as replication factor (RF), a store is split into a number of shards. A single shard will generally handle Npartitions/Nshards partitions. Your simple case has each shard hosting 50 partitions.

    A shard comprises one or more of what we call "replication nodes" (RepNodes) which are the actual server processes that handle client requests. In the case of replication factor 1 (your case) a shard has a single RepNode process. This means that it is not particularly available. If it fails you simply can't access the data is hosts. If you want a highly available store you must use a higher replication factor. This means 3 (I don't want to get into RF 2 right now -- briefly in that case a failure means you can still read but not write). RF 3 means you have 3 RepNodes serving a given shard -- one master and 2 replicas. In this case if one of them fails, another node is available to take over traffic. You generally will not see this failover -- it's transparent to the client.

    Note that if you want RF 3 you'll need additional storage nodes (a multiple of 3). In a deployed store that means more machines, but in your artificial test environment you can fake it with additional SNs on your machine. We test in that configuration all the time.

    I hope this helps,

    Regards,

    George
  • 4. Re: Explain Behavior when killing some processes...
    chpruvos Newbie
    Currently Being Moderated
    I read the doc before but I did not understand all ...

    Thank you very much for your explaination...

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points