I have a 4 server cluster on 2.2. I am having problems with the servers randomly rebooting and sometimes just flat out locking up. I have tried increasing/decreasing times with "service o2cb configure" with no luck. The VM, heartbeat, and management traffic all share the name network currently. I would like to separate this out to see if that could be my problem. The servers are Dell R710s with a Dell MD3200 attached by SAS.
So, in order to just start with the heartbeat traffic...I have 3 free nics on each server. Say I grab another switch to keep this simple and want to create a new network. Is the only place I need to change this in the /etc/ocfs2/cluster.conf file on each node besides setting up the nics?
My search skills seem to be lacking. The only references I can find are that people have done it but no specifics. Perhaps I missed this in one of the many oracle docs I have covered.
So today I moved all the VMs across xenbridges 1,2,3. I have none on xenbridge 0 which has all of my management, hearbeat, and migration traffic. I still have the same issue with hosts rebooting.
One example is when creating a windows or linux VM...When the guest gets to the part where it formats its disk, the host its on reboots everytime. This is very frustrating and unfortunate. Ive done all I know to do with what Ive found in the documentation.
Any ideas would be appreciated.
Yea Im worried about the storage being the issue as well.
I just came across this in /var/log/messages
modprobe: FATAL: Module ocfs2_stackglue not found.
Thats the only error I see in that log and the times I see it are right when the host reboots...
Well I caught an error on the console of the physical host. Typically by the time I made it down to the Datacenter the host had already been rebooted. The error was: "PCI-DMA: Out of SW-IOMMU space for...". I found a note on oracle support to add swiotlb=128 to the module line in grub.conf. That seems to have solved my issue for now. We shall see. Thanks for your replies.