The administration guide defines "balanced" storage nodes as storage nodes with exactly one replication group, and recommends to make the entire data store balanced.
I have workload of almost 50/50 updates/reads ratio. Each record is read exactly once before its updated again. I think my data set takes about 20G, so 60G with replication factor of 3. With 16G RAM servers, I'm thinking of getting 9 machines to store my data set, and allow some room for growth.
To have balanced configuration, I should create 3 replication groups of 3 servers each? But given my read-write ratio, this will leave most of the servers unused.
Wouldn't 9 replication groups of 3, with each server hosting 3 replication nodes, one of which can perform writes, make better sense? Do I have a way to make sure I'm running with 1 master per server?
I assume you have gone through the sizing exercize in the Admin Guide and worked with the spreadsheet.
Whatever number of Rep Groups you end up having, NoSQL Database will evenly distribute the records across all of them. I'm not sure if your comment about many of the nodes going underutilized was in reference to that or not. Assuming a uniform workload across the keyspace, it will be a uniform workload across the Rep Groups.
All writes will go to the master Rep Node in the relevant Rep Group. Presently we do not have much support for 'maste affinity' but we may in a fututre release.
Depending on the Consistency parameter passed to the api call, a read may or may not go to a replica. It sounds like you are doing RMW, so it may be perfectly reasonable to use a Consistency policy which allows reading from a replica. The subsequent write would go to the master in any case.
One clarification: With replication factor of 3, I have one node serving writes and two serving reads on each group. With all the writes going to the master, I have 50% of the workload going to 30% of the servers, and the other 50% of the workload going to 60% of the servers. That's the lack of balance I'm worried about.
The sizing exercise addresses disk space, and was helpful in that regard. It did not address throughput (or I missed something significant). In theory, I'd love to place one master on each storage node, to maximize the number of servers that serve writes.
You can have multiple Rep Nodes per Storage Node, but you have to be careful of resource contention (I.e. you should have each Rep Node on a separate spindle and you need to be careful that the Rep Nodes are from different Rep Groups). Also, we don't give you very good control of master affinity so it would be possible for multiple masters to end up on the same Storage Node. Better support for this may be available in the fututre.