We have a bunch of clusters ( typically comprises of 8 boxes) in different data centres which are doing fine. Recently, we have been asked to build a new one in a new location. Typical procedure has taken place including building up boxes, putting them on the same switch, assigning dedicated multicast and running datagram/multicast tests, all seems all right. The code deployed to this new cluster is also the same as the existing ones.
Now, If I start storage nodes on two boxes of those boxes, I can get MACHINE-SAFE statusHA. but the moment I start nodes on the any other machine, new nodes join the cluster but the statusHA goes to NODE-SAFE and remains there. some logs show "deferring the distribution due to ... pending configuration update"
We think that this to do with infrastructure, but we have been told that boxes have the same build and mcast running ok etc etc.
How can we take the investigation further and identify the root cause.