1 Reply Latest reply on Aug 20, 2018 7:28 PM by Todd Vierling-Oracle

    UEK5 4.14.35-1818.1.6 seems to be buggy / unstable under Xen

    dff8de88-5c60-478b-91d5-96b176d69e9c

      We tried to run an UEK5 kernel 4.14.35-1818.1.6 under Xen (Citrix Xenserver) host. While UEK4 runs fine, UEK5 is totally unstable, having random untraceable reboots multiple times a day all over the place under load. Alas, we could not diagnose anything because playing with panic and other debugging parameters do not help, it's just a sporadic machine reboot from how it looks like. Nothing in dmesg/logs/anywhere. The environment is UEK5 on CentOS7 distro, Apache+haproxy+a decent bit of php load.

       

      Once investigating that we saw something strange in /proc/interrupts contents, let me explain. We tried UEK4 and UEK5 on the same virtual machine and here's what in the part of cat /proc/interrupts output:

       

      1. UEK4 4.1.12-124.18.6.el7uek.x86_64

       

      126:       2327       4951     390812       7137      27065      34535          0          0       2034      26056          0          0   xen-dyn-event     eth0-q0-tx

      127:        789          0          0          0          0          0          0          0          0          0     308020          0   xen-dyn-event     eth0-q0-rx

      128:        238          0          0     422139       1753      21947          0          0          0          0          0      31922   xen-dyn-event     eth0-q1-tx

      129:     304956          0          0          0          0          0          0          0          0          0          0          0   xen-dyn-event     eth0-q1-rx

      130:        247      36307      10385       4564          0     376159          0          0          0          0          0          0   xen-dyn-event     eth0-q2-tx

      131:        395          0      14542          0          0          0          0          0          0          0          0     299347   xen-dyn-event     eth0-q2-rx

      132:        265       3711          0      56982          0          0          0     466002          0          0          0          0   xen-dyn-event     eth0-q3-tx

      133:        412       1220          0          0      33300          0          0          0          0     229686          0          0   xen-dyn-event     eth0-q3-rx

      134:         42      11563        741       1473      10777      12719       7901          0       6021       5203          0          0   xen-dyn-event     eth1-q0-tx

      135:       1083       9307       7185        492       9292       2996      13060       2507       3449          0        986       1140   xen-dyn-event     eth1-q0-rx

      136:         20       2071          0        805          0       4752      36652       2079        323       2296          0        807   xen-dyn-event     eth1-q1-tx

      137:         38       4338       1220        317       5194       1163          0          0      37544          0          0          0   xen-dyn-event     eth1-q1-rx

      138:         12       2430       3804      11079       5096       6457       7169        141      13129       1663          0          0   xen-dyn-event     eth1-q2-tx

      139:         41       3488       1045       1706      11638        622      10475      19493       2812          0       2394          0   xen-dyn-event     eth1-q2-rx

      140:       1077       4628       8442       1669       1468       1757       4265        980      18423          0          0       7954   xen-dyn-event     eth1-q3-tx

      141:       1939      16314        542       9944      13519        300       3586          0       5919       2254          0          0   xen-dyn-event     eth1-q3-rx

       

      As you see, all looks normal for Xen and interrupts are distributed properly.

       

      2. UEK5 4.14.35-1818.1.6.el7uek.x86_64

       

      126:   30243365          0          0          0          0          0          0          0          0          0          0          0   xen-dyn    -event     eth%d-q0-tx

      127:   28819600          0          0          0          0          0          0          0          0          0          0          0   xen-dyn    -event     eth%d-q0-rx

      128:   25702907          0          0          0          0          0          0          0          0          0          0          0   xen-dyn    -event     eth%d-q1-tx

      129:   21968191          0          0          0          0          0          0          0          0          0          0          0   xen-dyn    -event     eth%d-q1-rx

      130:   26463888          0          0          0          0          0          0          0          0          0          0          0   xen-dyn    -event     eth%d-q2-tx

      131:   23639256          0          0          0          0          0          0          0          0          0          0          0   xen-dyn    -event     eth%d-q2-rx

      132:   26334251          0          0          0          0          0          0          0          0          0          0          0   xen-dyn    -event     eth%d-q3-tx

      133:   24364986          0          0          0          0          0          0          0          0          0          0          0   xen-dyn    -event     eth%d-q3-rx

      134:   10540611          0          0          0          0          0          0          0          0          0          0          0   xen-dyn    -event     eth%d-q0-tx

      135:   10971404          0          0          0          0          0          0          0          0          0          0          0   xen-dyn    -event     eth%d-q0-rx

      136:   11638510          0          0          0          0          0          0          0          0          0          0          0   xen-dyn    -event     eth%d-q1-tx

      137:   10337372          0          0          0          0          0          0          0          0          0          0          0   xen-dyn    -event     eth%d-q1-rx

      138:   10443763          0          0          0          0          0          0          0          0          0          0          0   xen-dyn    -event     eth%d-q2-tx

      139:   10057395          0          0          0          0          0          0          0          0          0          0          0   xen-dyn    -event     eth%d-q2-rx

      140:   10378811          0          0          0          0          0          0          0          0          0          0          0   xen-dyn    -event     eth%d-q3-tx

      141:   10211251          0          0          0          0          0          0          0          0          0          0          0   xen-dyn    -event     eth%d-q3-rx

       

      Tabs between dyn and -event do really exist in the output, it's not a forum copy-paste issue.

       

      Basically, interrupt names are borked somehow and what's more important, these interrupts all land on CPU0. So there's something in the kernel preventing their distribution. Of course, we are not sure how this is related to instability we see, but given from how it looks like, it definitely looks like some weird bug...

       

      Non-Xen interrupt names are borked as well, but their distribution seems to be ok.

       

        0:        137          0   IO-APIC   2-edge      timer

      14:          0          0   IO-APIC  14-edge      ata_piix

      15:     122502     123107   IO-APIC  15-edge      ata_piix

        • 1. Re: UEK5 4.14.35-1818.1.6 seems to be buggy / unstable under Xen
          Todd Vierling-Oracle

          dff8de88-5c60-478b-91d5-96b176d69e9c wrote:

           

          We tried to run an UEK5 kernel 4.14.35-1818.1.6 under Xen (Citrix Xenserver) host. While UEK4 runs fine, UEK5 is totally unstable, having random untraceable reboots multiple times a day all over the place under load. Alas, we could not diagnose anything because playing with panic and other debugging parameters do not help, it's just a sporadic machine reboot from how it looks like. Nothing in dmesg/logs/anywhere. The environment is UEK5 on CentOS7 distro, Apache+haproxy+a decent bit of php load.

           

          Once investigating that we saw something strange in /proc/interrupts contents, let me explain. We tried UEK4 and UEK5 on the same virtual machine and here's what in the part of cat /proc/interrupts output:

           

          2. UEK5 4.14.35-1818.1.6.el7uek.x86_64

           

          126: 30243365 0 0 0 0 0 0 0 0 0 0 0 xen-dyn -event eth%d-q0-tx

           

          Thanks for the report. The issue with "eth%d" naming is covered by a bug open internally, which we believe should be fixed fairly soon. Watch for changelog lines with summary text matching these mainline commits.

           

          That leaves the reboots and IRQ weirdness.

           

          The former are more important to address first, certainly:

          1. Are these actual crashes (are you able to get a core with on_crash="coredump-reboot"), or spontaneous reboots that don't trigger Xen's on_crash logic?
          2. If you can't get a coredump, can you capture the text "xm console" output (after ensuring that virtual-serial console is configured)?
          3. Are you using HV, PV, or PVH? (Note that pure PV is no longer supported on OL7, so it would have to be HV or PVH. Internally we specifically test HV, with the Xen PV drivers as primary net/disk.)