9 Replies Latest reply on Oct 4, 2019 4:46 PM by 2842182

    Exadata X3-2 crashing with Oracle VM 3.4.6

    2842182

      Hello all,

       

      I've been using oracle vm for some time now, and until recently everything was just great.

       

      Until the last patch everything was just running fine without issues.

       

      We have some OVM pools for different type of servers. When we started patching lately (we do quarterly patch cycles), some of our test hypervisors started crashing hard without any information on the logs of the server.

       

      We fully patched Oracle VM to the 3.4.6 release running the kernel 4.1.12-124.30.1.el6uek.x86_64.

       

      When the servers started to crash, I've done the extended hardware test on the ilom, and got some errors on the cpus.

      I procured replacement cpus for one of the servers, and did the extended hardware test again, that finished without errors.

       

      The problem is that the crashes are still happening.

      I configured the kdump functionality to see if I could get more details on the os on the reason of the crashes, but no dump is generated on the system crash.

       

      I've also found multiple references to the issue reported on https://docs.oracle.com/cd/E64076_01/E76173/html/vmrns-bugs-3.3.2-swiotlb-buffer-errors-jumbo-frames.html , because I'm using iscsi disks with multiple 10Gb ethernet with jumbo frames with link aggregation, I changed the value on the kernel command line, to no luck on solving my issues.

      Note that the same storage is being used on other pools that were not updated, and have no issues until now.

       

      The only lead I have is on the ilom of the server with the replaced cpus, that still brings out the following message when the server crashes:

       

      Wed Sep 4 21:01:29 2019IPMILogminor
      ID = 8bb : 09/04/2019 : 21:01:29 : System Firmware Progress : SMI Handler : PCI resource configuration : Asserted

      2274

      Wed Sep 4 21:01:29 2019IPMILogminor
      ID = 8ba : 09/04/2019 : 21:01:29 : System Firmware Progress : SMI Handler : PCI resource configuration : Asserted

      2273

      Wed Sep 4 21:01:21 2019IPMILogminor
      ID = 8b9 : 09/04/2019 : 21:01:21 : Processor : BIOS : SM BIOS Uncorrectable CPU-complex Error Node 1 Bank = 3 OVER = 1 UC = 1 EN = 0 MISCV = 0 ADDRV = 0 PCC = 1 S = 0 AR = 0 : Asserted

      2272

      Wed Sep 4 21:01:21 2019IPMILogminor
      ID = 8b8 : 09/04/2019 : 21:01:21 : System Firmware Progress : SMI Handler : Secondary CPU Initialization : Asserted

      2271

      Wed Sep 4 21:01:21 2019IPMILogminor
      ID = 8b7 : 09/04/2019 : 21:01:20 : System Firmware Progress : SMI Handler : Primary CPU initialization : Asserted

      2270

      Wed Sep 4 21:00:53 2019IPMILogminor
      ID = 8b6 : 09/04/2019 : 21:00:53 : System Firmware Progress : SMI Handler : Memory initialization : Asserted

      2269

      Wed Sep 4 21:00:51 2019IPMILogminor
      ID = 8b5 : 09/04/2019 : 21:00:51 : Processor : BIOS : SM BIOS Uncorrectable CPU-complex Error Node 1 Bank = 17 OVER = 0 UC = 1 EN = 1 MISCV = 1 ADDRV = 1 PCC = 1 S = 1 AR = 0 : Asserted

      2268

      Wed Sep 4 21:00:50 2019IPMILogminor
      ID = 8b4 : 09/04/2019 : 21:00:50 : Processor : BIOS : SM BIOS Uncorrectable CPU-complex Error Node 1 Bank = 17 OVER = 0 UC = 1 EN = 1 MISCV = 1 ADDRV = 1 PCC = 1 S = 1 AR = 0 : Asserted

      2267

      Wed Sep 4 21:00:50 2019IPMILogminor
      ID = 8b3 : 09/04/2019 : 21:00:50 : Processor : BIOS : SM BIOS Uncorrectable CPU-complex Error Node 0 Bank = 5 OVER = 0 UC = 1 EN = 1 MISCV = 1 ADDRV = 1 PCC = 1 S = 0 AR = 0 : Asserted

      2266

      Wed Sep 4 21:00:49 2019IPMILogminor
      ID = 8b2 : 09/04/2019 : 21:00:49 : Processor : BIOS : SM BIOS Uncorrectable CPU-complex Error Node 0 Bank = 5 OVER = 0 UC = 1 EN = 1 MISCV = 1 ADDRV = 1 PCC = 1 S = 0 AR = 0 : Asserted

      2265

      Wed Sep 4 21:00:49 2019IPMILogminor
      ID = 8b1 : 09/04/2019 : 21:00:49 : Processor : BIOS : SM BIOS Uncorrectable CPU-complex Error Node 0 Bank = 20 OVER = 0 UC = 1 EN = 1 MISCV = 1 ADDRV = 1 PCC = 1 S = 0 AR = 0 : Asserted

      2264

      Wed Sep 4 21:00:47 2019FaultFaultcritical
      Fault detected at time = Wed Sep 4 21:00:47 2019. The suspect component: /SYS/MB/P0 has fault.cpu.intel.quickpath.home_agent with probability=100. Refer to http://support.oracle.com/msg/SPX86-8003-CR for details.

      2263

      Wed Sep 4 21:00:47 2019IPMILogminor
      ID = 8b0 : 09/04/2019 : 21:00:47 : Processor : BIOS : SM BIOS Uncorrectable CPU-complex Error Node 0 Bank = 20 OVER = 0 UC = 1 EN = 1 MISCV = 1 ADDRV = 1 PCC = 1 S = 0 AR = 0 : Asserted

      2262

      Wed Sep 4 21:00:44 2019IPMILogminor
      ID = 8af : 09/04/2019 : 21:00:44 : System Firmware Progress : SMI Handler : Management controller initialization : Asserted

      2261

      Wed Sep 4 21:00:44 2019IPMILogminor
      ID = 8ae : 09/04/2019 : 21:00:44 : Processor : System Management Software : IERR : Asserted

      2260

      Wed Sep 4 17:16:31 2019IPMILogminor
      ID = 8ad : 09/04/2019 : 17:16:31 : System Firmware Progress : SMI Handler : System boot initiated : Asserted

       

      I've also updated the firmware on the server to the latest available level:

       

      Integrated Lights Out Manager v4.0.4.22.a

       

      Settings

      17160200
      Legacy
      Ok

       

      Do any of you have a similar issue?

       

      I can't open a support ticket with Oracle, as I'm not the original buyer of the servers.

       

      Any help would be really appreciated.

       

      Thank you.

       

      Kind regards,

      Jorge

        • 1. Re: Exadata X3-2 crashing with Oracle VM 3.4.6
          Harshitajain-Oracle

          Hi,

           

          Do you have enough space(minimum size of the VM) in the directory /var/crash (generally the path for vmcore to be generated)?

           

          Regards,

          Harshita.

          • 2. Re: Exadata X3-2 crashing with Oracle VM 3.4.6
            2842182

            Hello Harshita,

             

            Thank you for your interest in helping me.

             

            The /var/crash in on the root file system, and has 43Gb of free space.

            The Dom0 has less than 6Gb of ram, and all VMs are at most 4Gb.

            The server has 256Gb of ram, but the oracle vm manager is repoiting a usage of 41.86 Gb.

             

            Even with all this, I don't see any attempts or errors on the system logs. I even turned on the kernel log on syslog into a file, and nothing is logged in the os.

             

            My problem is that the server just "dies" and there is no trace on the logs of any kind.

             

            I'm trying to get a hold on a serial port to connect to the server and get the console logs to be dumped to the serial port to see if I get any more information.

             

            Kind regards,

            Jorge

            • 3. Re: Exadata X3-2 crashing with Oracle VM 3.4.6
              2842182

              Hello again,

               

              Just a quick update on the issue.

               

              There was a release of a new kernel for oracle vm and I've updated it now, to see if it helps in any way.

               

              The new kernel is 4.1.12-124.31.1 and the errata reads as follow:

               

              - dm bufio: fix deadlock with loop device (Junxiao Bi) [Orabug: 29964645]

              - dm bufio: don't take the lock in dm_bufio_shrink_count (Mikulas Patocka) [Orabug: 29964645]

              - rds: rds-info shows IPv4 address as '0.0.0.0' (aru kolappan) [Orabug: 30022915]

              - restore cond_resched() in shrink_dcache_parent() (Al Viro) [Orabug: 30101895]

              - retpoline: Move retpoline_mode_selected() out of .init.text section (Alejandro Jimenez) [Orabug: 30250332]

               

              I'm still trying to find the description of the bugs, but the dm bugs seem a good candidate to match my issue.

               

              Thank you.

               

              Kind regards,

              Jorge

              • 4. Re: Exadata X3-2 crashing with Oracle VM 3.4.6
                2842182

                Hello again,

                 

                Even with the new kernel, I just experienced another crash on the system.

                 

                So no luck on the system update.

                 

                Kind regards,

                Jorge

                • 5. Re: Exadata X3-2 crashing with Oracle VM 3.4.6
                  Harshitajain-Oracle

                  Hi,

                   

                  Oh. Did the crash dump generate?

                   

                  Regards,

                  Harshita.

                  • 6. Re: Exadata X3-2 crashing with Oracle VM 3.4.6
                    2842182

                    Hello again,

                     

                    No luck. Still no crash dump. But I'm much more convinced that it is a hardware issue, since I deployed a different server to run some vms that were running on the servers that would crash, and have 2 and a half days uptime, and no signs of issues.

                     

                    As so, I think it is really a hardware issue, and not an oracle vm issue.

                     

                    Either way I will try to get to the bottom of the issue and will report my findings here.

                     

                    Kind regards,

                    Jorge

                    • 7. Re: Exadata X3-2 crashing with Oracle VM 3.4.6
                      2842182

                      Hello,

                       

                      This problem is definitely an OVM issue. I've tested the servers with memtest from a live cd, and the server shows the same behaviour after a long run of memtest, crashing without any alert.

                       

                      As this is not OVM Related, I'm closing this thread, since this doesn't relate to OVM in any way.

                       

                      Thank you Harshita for your interest.

                       

                      Kind regards,

                      Jorge

                      • 8. Re: Exadata X3-2 crashing with Oracle VM 3.4.6
                        2842182

                        Hello again,

                         

                        I have a further update on this issue.

                         

                        I have some other servers running in almost the same hardware, and definitely the same cpu.

                         

                        I've update them last friday, and I started seeing the same issue on those servers. They are running a completely different os and kernel. They are running centos 7 with mainline kernel (5.3.x).

                         

                        They behave similarly to these servers with some crashes without any traces.

                         

                        I've correlated this to the microcode update released for these cpus.

                         

                        The only thing that I can think is the same across all servers is the microcode, and the servers that started showing the symptoms were right after the centos 7 update that installed the same microcode installed on the oracle vm servers.

                         

                        Because I've updated the BIOS on the oracle vm servers, I'm stuck with the new microcode until a fix is out. On the other hand, I was able to downgrade the microcode on the servers running centos 7, as I didn't update the bios on those servers.

                         

                        With this in mind, the new microcode release reported on the kernel is 0x718 for the cpu Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz.

                        On the centos 7 I was able to downgrade the microcode to release 0x714. I will update this in a week to let everyone know if this is in fact a microcode issue or not.

                         

                        I've opened an issue also on github intel microcode site that can be seen at https://github.com/intel/Intel-Linux-Processor-Microcode-Data-Files/issues/15

                         

                        Kind regards,

                        Jorge

                        • 9. Re: Exadata X3-2 crashing with Oracle VM 3.4.6
                          2842182

                          Hello again,

                           

                          This might be related to the microcode issue after all.

                           

                          You can see on the page from my last comment that Intel and Oracle are working on an issue that might be related to this.

                           

                          Kind regards,

                          Jorge