4 Replies Latest reply: Jan 22, 2013 6:59 AM by user7757478 RSS

    VM Server rebooting for unknown reason

    user7757478
      Hi folks,
      I have a testsystem with two OVM server (3.0.3) hosting four RACs. That means on each VM server I have four RAC-VMs running. This cinfiguration worked well for three month. Even nothing changed we now facing the problem that the second VM server reboots. I checked the message file and it seems that there are som problems with the SAN storage (NetApp via FC). But only the second node is affected and from SAN side (storage and FS switches) there is nothing unusual. What I have found out so far is that if I dont start any of the VM the server survives but when I only start one VM after some time the server reboots.
      Here the message log:
      ...
      Jan 8 18:30:29 bdtzlp02 multipathd: 360a98000646e4f4e674a6b662f4d2f7a: sdev - directio checker reports path is down
      Jan 8 18:30:29 bdtzlp02 multipathd: checker failed path 129:112 in map 360a98000646e4f4e674a6b662f4d2f7a
      Jan 8 18:30:29 bdtzlp02 multipathd: 360a98000646e4f4e674a6b662f4d2f7a: remaining active paths: 5
      Jan 8 18:30:29 bdtzlp02 kernel: [14753.710463] device-mapper: multipath: Failing path 129:112.
      Jan 8 18:30:29 bdtzlp02 multipathd: 360a98000646e4f4e674a6b665a78314a: sdeo - directio checker reports path is down
      Jan 8 18:30:29 bdtzlp02 multipathd: checker failed path 129:0 in map 360a98000646e4f4e674a6b665a78314a
      Jan 8 18:30:29 bdtzlp02 multipathd: 360a98000646e4f4e674a6b665a78314a: remaining active paths: 5
      Jan 8 18:30:29 bdtzlp02 kernel: [14753.714442] device-mapper: multipath: Failing path 129:0.
      Jan 8 18:30:29 bdtzlp02 multipathd: 360a98000646e4f4e674a6b662f47346e: sdfy - directio checker reports path is down
      Jan 8 18:30:29 bdtzlp02 multipathd: checker failed path 131:64 in map 360a98000646e4f4e674a6b662f47346e
      Jan 8 18:30:29 bdtzlp02 kernel: [14753.722444] device-mapper: multipath: Failing path 131:64.
      Jan 8 18:30:29 bdtzlp02 multipathd: 360a98000646e4f4e674a6b662f47346e: remaining active paths: 4
      Jan 8 18:30:29 bdtzlp02 multipathd: dm-28: add map (uevent)
      ...
      Jan 8 18:30:50 bdtzlp02 o2hbmonitor: Last ping 55812 msecs ago on /dev/dm-1, 0004FB0000050000A514E628F76D9293
      Jan 8 18:30:52 bdtzlp02 o2hbmonitor: Last ping 57816 msecs ago on /dev/dm-1, 0004FB0000050000A514E628F76D9293
      Jan 8 18:30:53 bdtzlp02 multipathd: 360a98000646e4f4e674a6b346578555a: sdec - directio checker reports path is down
      Jan 8 18:30:53 bdtzlp02 multipathd: 360a98000646e4f4e674a6b34665a6d6d: sdef - directio checker reports path is down
      Jan 8 18:30:53 bdtzlp02 multipathd: 360a98000646e4f4e674a6b346666484a: sdeg - directio checker reports path is down
      Jan 8 18:30:53 bdtzlp02 multipathd: 360a98000646e4f4e674a6b662f417434: sder - directio checker reports path is down
      Jan 8 18:30:53 bdtzlp02 multipathd: 360a98000646e4f4e674a6c7977423655: sdge - directio checker reports path is down
      Jan 8 18:30:54 bdtzlp02 o2hbmonitor: Last ping 59820 msecs ago on /dev/dm-1, 0004FB0000050000A514E628F76D9293
      Jan 8 18:30:54 bdtzlp02 multipathd: 360a98000646e4f4e674a6b665a685774: sdek - directio checker reports path is down
      Jan 8 18:30:54 bdtzlp02 multipathd: 360a98000646e4f4e674a6b665a59506e: sdei - directio checker reports path is down
      Jan 8 18:30:54 bdtzlp02 multipathd: 360a98000646e4f4e674a6b662f47346e: sdet - directio checker reports path is down
      Jan 8 19:38:33 bdtzlp02 syslogd 1.4.1: restart.

      Hint: system time is different from ntp time, hence there is a 'time jump' in the messages.

      It seems to me that when I start one VM and SAN access increases then the server gets problems.

      Which files to check? Is there a command the check the status of the FC-hba? Other ideas?

      Thanks for your help & regards

      Axel D.
        • 1. Re: VM Server rebooting for unknown reason
          user12273962
          Not that I think you may have a time issue but the time should be different. Also, did you use the RAC templates for your hosts or did you build them yourselves?

          If you built them yourselves, there are changes that you have to make to the VM guests for timing issues. The RAC templates already have this. Check out this best practices document.

          http://www.oracle.com/technetwork/database/clustering/oracle-rac-in-oracle-vm-environment-131948.pdf

          Also notice there are alot more changes that need to be made if your running 11.1 and earlier versus using 11.2 database.

          1. Are you using virtual disks from a FC repository or are you using direct access?

          2. Is there a reason you're using 3.0.3 instead of 3.1.1?

          3. Have you recently customized your multipath config? Are you still using the native "multipather"? Does multipath -ll show any path down

          4. Not knowing what all you have connected.... the log indicates you go from 5 paths to 4 paths. Does that mean you had 6 paths to start with and both paths to one device went completely offline? This would be why the server rebooted.
          • 2. Re: VM Server rebooting for unknown reason
            user7757478
            Hi,

            yes, I used the RAC templates and so far I did not face problems. I dont know what you exactly mean with FC repository. We have a NetApp resp. IBM storage from which we map the corresponding LUNs (the OVS repository and the pool as well as the five LUN's per VM/RAC) to the VM server.
            We are still using 3.0.3 as we have a productive system on 3.0.3 and we havent already started our update activities (testing, approvement etc.).
            The mutlipath configuration is unchanged and we ar using the driver which comes with the VM server.
            Currently as the VM server is up and running (I did not start any VM so far) all paths are 'active ready running' but Im afraid that starting only one VM would bring it down again (that happened three times so guess we can call it reproducable ;-).

            Thanks & regards

            Axel D.
            • 3. Re: VM Server rebooting for unknown reason
              Dave Smulsky
              Hows the health of the LUN your using for your POOLFS? Any issues obvious? Are you using NFS for anything??

              Jan 8 18:30:50 bdtzlp02 o2hbmonitor: Last ping 55812 msecs ago on /dev/dm-1, 0004FB0000050000A514E628F76D9293
              Jan 8 18:30:52 bdtzlp02 o2hbmonitor: Last ping 57816 msecs ago on /dev/dm-1, 0004FB0000050000A514E628F76D9293
              • 4. Re: VM Server rebooting for unknown reason
                user7757478
                Hello,

                I think I found the culprit: it was one of the two FC-HBAs that we have installed on our server. There wasn't any message that showed a defect but when I pulled out the FC cables from that HBA (we still have a second one which handles the traffic) the server didnt restart any more.

                Thanks & regards

                AxelD.