1 2 Previous Next 24 Replies Latest reply: Jan 18, 2011 7:43 PM by 524761 Go to original post RSS
      • 15. Re: Slow network connection possibly causing a slave to crash.
        524761
        Although you say "different symptoms," I assume you just mean that the error reporting looks different, because it looks to me as if the same thing is happening, around the same place?

        Can you post a bit more of the master's verbose output, around the time corresponding to the LSN of the message whose buffer triggered the crash? We can see the LSN in the dump of those 36 bytes we've been looking at, in the second line. For example, from the first case you posted, we see this in the second line:
        0x8198a50: 0x00 0x00 0x00 0xe1 0x00 0x34 0xb7 0xa8
        The first 4 bytes are the 32-bit "file" portion of the LSN, and the last 4 bytes are the 32-bit "offset":
        225,3454888 (if I calculated correctly).

        Also, I assume the client isn't reporting any sort of connection failure around the time of the crash?

        Alan Bram
        Oracle
        • 16. Re: Slow network connection possibly causing a slave to crash.
          676400
          Actually you are probably correct about the location of the crash. It does look somewhat identical to the HK JVM dumps, with the exception that the glibc on the SUSE Linux UK box is giving me the "*** glibc detected *** double free or corruption (out): 0xb7600680 ***" verbiage which I don't get on the Red Hat HK machine.

          I will re-enable the verbose replication logging on the master side and get gdb installed on the UK slave.

          No the client isn't reporting any sort of connection failure around the time of the crash.

          Thanks,

          Neil
          • 17. Re: Slow network connection possibly causing a slave to crash.
            676400
            Actually,

            Further testing on the UK slave yielded a different crash this afternoon.

            *** glibc detected *** corrupted double-linked list: 0x8b3a9568 ***
            *** glibc detected *** double free or corruption (out): 0x8b3a9570 ***

            at...

            C [libc.so.6+0x614de]
            C [libc.so.6+0x63117]
            C [libc.so.6+0x64a14] malloc+0x94
            C [libdb_java-4.7.so+0x1606a6] __os_malloc+0x57
            C [libdb_java-4.7.so+0x52a53]
            C [libdb_java-4.7.so+0x52447] __bam_split+0x1eb
            C [libdb_java-4.7.so+0x3e473]
            C [libdb_java-4.7.so+0xfd878] __dbc_put+0x102d
            C [libdb_java-4.7.so+0xee947] __db_put+0x1d6
            C [libdb_java-4.7.so+0xb6460] __rep_apply+0x869
            C [libdb_java-4.7.so+0xadcf6] __rep_log+0x6a
            C [libdb_java-4.7.so+0xb40ca] __rep_process_message+0x1512
            C [libdb_java-4.7.so+0xc1ec2]
            C [libdb_java-4.7.so+0xc1e00]
            C [libdb_java-4.7.so+0xc1d74] __repmgr_msg_thread+0x24
            C [libpthread.so.0+0x5cb7]

            The JVM dump is reporting a free space of only 126k, so this could simply have been an allocation failure. I will try enabling the diagnosics on this slave once again and see if I can catch anything. Unfortunately gdb is not yet on the box.

            Thanks,

            Neil
            • 18. Re: Slow network connection possibly causing a slave to crash.
              676400
              OK, we have had gdb installed on the UK machine.

              I have rerun the heavy load scenario with a debug build only as enabling diagnostics and/or setting the MALLOC_CHECK_ environment variable seems to mask the problem.

              The UK machine is giving me a little more insight as I am now getting the following glibc message.

              *** glibc detected *** double free or corruption (out): 0xb7600698 ***

              I went through the usual exercise of getting the replication message information from the thread that seemed to raise the SIGSEGV.

              The backtrace...

              #9 <signal handler called>
              #10 0x4009a6a6 in intfree () from /lib/tls/libc.so.6
              #11 0x4009ab2b in free () from /lib/tls/libc.so.6
              #12 0x8c0e1843 in __os_free (env=0x0, ptr=0xb76005e0) at os_alloc.c:405
              #13 0x8c042e45 in message_loop (env=0x8b5f55f8) at repmgr_msg.c:49
              #14 0x8c042d74 in __repmgr_msg_thread (args=0x8b5f55f8) at repmgr_msg.c:29
              #15 0x4002acb7 in start_thread () from /lib/tls/libpthread.so.0
              #16 0x400f221e in clone () from /lib/tls/libc.so.6
              #17 0xafc8ebb0 in ?? ()

              (gdb) print msg
              $1 = (REPMGR_MESSAGE *) 0xb76005e0
              (gdb) print msg->control
              $2 = {data = 0xb7600620, size = 36, ulen = 0, dlen = 0, doff = 0, app_data = 0x0, flags = 0}
              (gdb) x/36xb msg->control.data
              0xb7600620: 0x00 0x00 0x00 0x05 0x00 0x00 0x00 0x0e
              0xb7600628: 0x00 0x00 0x00 0x12 0x00 0x84 0x62 0x58
              0xb7600630: 0x00 0x00 0x00 0x0b 0x00 0x00 0x00 0x01
              0xb7600638: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
              0xb7600640: 0x00 0x00 0x00 0x64

              Which I believe corresponds to lsn/offset [18][8675928].

              I also took a shot in the dark and inspected the address glibc is complaining about.

              (gdb) print ((REPMGR_MESSAGE *) 0xb7600698)->control
              $4 = {data = 0xb76006d8, size = 36, ulen = 0, dlen = 0, doff = 0, app_data = 0x0, flags = 0}

              That seemed sane, so...

              (gdb) x/36xb ((REPMGR_MESSAGE *) 0xb7600698)->control.data
              0xb76006d8: 0x00 0x00 0x00 0x05 0x00 0x00 0x00 0x0e
              0xb76006e0: 0x00 0x00 0x00 0x12 0x00 0x84 0x62 0xe0
              0xb76006e8: 0x00 0x00 0x00 0x0b 0x00 0x00 0x00 0x01
              0xb76006f0: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
              0xb76006f8: 0x00 0x00 0x00 0x44

              Which would correspond to lsn/offset [18][8676064].

              I understand the address that glibc is complaining about maybe/possibly could have been reused by the time the signal was raised, so I am not putting too much faith in that.

              I took a look at the other replication threads, but that address doesn't correspond to any replication message structures or message data areas currently in flight.

              The crash on the slave occurred at 21:52:02 GMT which would be 2010-09-13 16:52 EST on our master instance.

              So I took a look at the verbose replication logging output on our master, but unfortunately I am not seeing anything terribly remarkable (too me) at that time. Except that both [18][8675928] and [18][8676064] occur very close too each other in the log output, and that the time of the crash also coincides with when both of these were sent along with many others.

              The sequence of events from the masters perspective during that run follow…
              .
              .
              .
              2010-09-13 16:52:01,897 [Thread-5] INFO : MASTER: /path/to/environment/directory rep_send_message: msgv = 5 logv 14 gen = 1 eid 1, type log, LSN [18][8675928] nobuf resend
              2010-09-13 16:52:01,897 [Thread-5] INFO : MASTER: /path/to/environment/directory rep_send_message: msgv = 5 logv 14 gen = 1 eid 1, type log, LSN [18][8676016] nobuf resend
              2010-09-13 16:52:01,897 [Thread-5] INFO : MASTER: /path/to/environment/directory rep_send_message: msgv = 5 logv 14 gen = 1 eid 1, type log, LSN [18][8676064] nobuf resend
              .
              .
              .
              2010-09-13 16:52:03,508 [Thread-5] INFO : MASTER: msg to site XXX.XX.XXX.XX:XXXXX to be queued
              2010-09-13 16:52:03,508 [Thread-5] INFO : MASTER: block msg thread, await queue space
              2010-09-13 16:52:03,508 [Thread-5] INFO : MASTER: will await drain
              2010-09-13 16:52:13,514 [Thread-5] INFO : MASTER: drain returned 0 (0,10)
              2010-09-13 16:52:13,514 [Thread-5] INFO : MASTER: queue limit exceeded
              2010-09-13 16:52:13,515 [Thread-5] INFO : MASTER: rep_send_function returned: -30888

              If you require anymore information, I am letting gdb sit on the slave process until I hear otherwise.

              Thanks Again,

              Neil

              Edited by: vannote on Sep 15, 2010 12:08 PM (added the actual backtrace)
              • 19. Re: Slow network connection possibly causing a slave to crash.
                524761
                Hi Neil,

                A colleague wondered if there might be mutex bugs on the Linux
                platform. Although it seems quite a long shot, it does seem as if it
                could perhaps lead to this crash.

                Another colleague found this information about known/fixed bugs:

                https://rhn.redhat.com/errata/RHBA-2007-0471.html

                Is your platform on a 64-bit machine? And, do you know whether it is
                up to date with the bug fix mentioned on that list ("pthread_mutex_timedlock
                fix on x86_64")?

                If we think that might be a possibility, I might be able to give you a
                diagnostic patch, to add some information that each thread could
                maintain about its recent history of memory buffers. (We could hope
                that this change perturbs the timing less than turning on verbose
                and/or DIAGNOSTIC, which we've seen can mask the problem.) We would
                then have to dump out the contents of those memory buffers in a
                debugger after the point of this crash.

                If you're interested in pursuing that, please send me e-mail. (My
                e-mail address follows the usual "firstname.lastname@oracle.com"
                convention.)

                - Alan
                • 20. Re: Slow network connection possibly causing a slave to crash.
                  676400
                  The UK slave is a dual quad core Intel Xeon E5440 @ 2.83GHz running the following 32-bit distribution…

                  SUSE LINUX Enterprise Server 9 (i586)
                  VERSION = 9
                  PATCHLEVEL = 3
                  LSB_VERSION="core-2.0-noarch:core-3.0-noarch:core-2.0-ia32:core-3.0-ia32"
                  libc:glibc 2.3.5 NPTL 2.3.5

                  The HK Slave is a dual dual core Intel Xeon 5150 @ 2.66GHz running the following 32-bit distribution.

                  Red Hat Enterprise Linux ES release 3 (Taroon Update 8)
                  LSB_VERSION="1.3"
                  libc:glibc 2.3.2 NPTL 0.60

                  So, yes they are both 64-bit hardware, but both are running a 32-bit OS. It wouldn't surprise me if the Red Hat machine could use some updating, it looks like Red Hat has RPMs for the a 32-bit ES release 3. I am unsure if these fixes also apply to the SUSE machine, but the symptoms are the same so it could be the case.

                  Your thoughts on the mater? I will also discuss with the SAs here...

                  Thanks,

                  Neil
                  • 21. Re: Slow network connection possibly causing a slave to crash.
                    524761
                    Here's a simple experiment that might be interesting:

                    Change the client so that it runs with just one message processing
                    thread (by passing "1" to repmgr_start()). If the crash still
                    happens, then it rules out the mutex bug theory. If the crash does
                    not still happen, I wouldn't call it strong support for the theory;
                    but it would intrigue me enough to suggest trying the diag patch we
                    were contemplating.
                    • 22. Re: Slow network connection possibly causing a slave to crash.
                      676400
                      Well,

                      I reran the test with a single replication thread and I am still getting the SIGSEGV, although this time I didn't get the "double free or corruption" message that I normally get out of glibc on this SLES 9 machine...

                      (gdb) thread 5
                      (gdb) bt
                      .
                      .
                      .
                      #9 <signal handler called>
                      #10 0x4009a6a6 in intfree () from /lib/tls/libc.so.6
                      #11 0x4009ab2b in free () from /lib/tls/libc.so.6
                      #12 0x8c07a843 in __os_free (env=0x0, ptr=0xb7500530) at os_alloc.c:405
                      #13 0x8bfdbe45 in message_loop (env=0x8bdb1fb8) at repmgr_msg.c:49
                      #14 0x8bfdbd74 in __repmgr_msg_thread (args=0x8bdb1fb8) at repmgr_msg.c:29
                      #15 0x4002acb7 in start_thread () from /lib/tls/libpthread.so.0
                      #16 0x400f221e in clone () from /lib/tls/libc.so.6
                      #17 0xafc06bb0 in ?? ()

                      (gdb) select-frame 13
                      (gdb) print msg

                      $1 = (REPMGR_MESSAGE *) 0xb7500530

                      (gdb) print msg->control

                      $2 = {data = 0xb7500570, size = 36, ulen = 0, dlen = 0, doff = 0, app_data = 0x0, flags = 0}

                      (gdb) x/36xb msg->control.data
                      0xb7500570: 0x00 0x00 0x00 0x05 0x00 0x00 0x00 0x0e
                      0xb7500578: 0x00 0x00 0x00 0x27 0x00 0x6a 0x62 0x2e
                      0xb7500580: 0x00 0x00 0x00 0x0b 0x00 0x00 0x00 0x01
                      0xb7500588: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
                      0xb7500590: 0x00 0x00 0x00 0x64

                      Corresponding to lsn: 39, offset: 6971950; Again, nothing remarkable in the masters log.

                      Do you think it would still be worth looking into the diagnostic patch? I would be willing to apply it if you feel it would be helpful towards resolving this issue. I am fairly confident that these machines have never had glibc patched.

                      Thanks,

                      Neil
                      • 23. Re: Slow network connection possibly causing a slave to crash.
                        676400
                        I ran it a second time with the single replication thread and I did get the glibc diagnostics on this run…

                        *** glibc detected *** corrupted double-linked list: 0x8be09450 ***
                        *** glibc detected *** double free or corruption (out): 0x8be09458 ***

                        I also got a different back trace!

                        #9 <signal handler called>
                        #10 0x4009b388 in intmalloc () from /lib/tls/libc.so.6
                        #11 0x4009ca14 in malloc () from /lib/tls/libc.so.6
                        #12 0x40061763 in qsort () from /lib/tls/libc.so.6
                        #13 0x8c0383e2 in __rep_process_txn (env=0x8b5ca290, rec=0x84576dc) at rep_record.c:1452
                        #14 0x8c03943f in __rep_process_rec (env=0x8b5ca290, ip=0x0, rp=0xafc6d2e0, rec=0x84576dc, ret_tsp=0xafc6d118, ret_lsnp=0xafc6d140) at rep_record.c:1940
                        #15 0x8c037110 in __rep_apply (env=0x8b5ca290, ip=0x0, rp=0xafc6d2e0, rec=0x84576dc, ret_lsnp=0xafc6d3c8, is_dupp=0xafc6d208, last_lsnp=0xafc6d220)
                        at rep_record.c:1052
                        #16 0x8c02ecf6 in __rep_log (env=0x8b5ca290, ip=0x0, rp=0xafc6d2e0, rec=0x84576dc, savetime=1284755753, ret_lsnp=0xafc6d3c8) at rep_log.c:237
                        #17 0x8c0350ca in __rep_process_message (dbenv=0x8b5c9f00, control=0x84576c0, rec=0x84576dc, eid=0, ret_lsnp=0xafc6d3c8) at rep_record.c:575
                        #18 0x8c042ec2 in process_message (env=0x8b5ca290, control=0x84576c0, rec=0x84576dc, eid=0) at repmgr_msg.c:78
                        #19 0x8c042e00 in message_loop (env=0x8b5ca290) at repmgr_msg.c:44
                        #20 0x8c042d74 in __repmgr_msg_thread (args=0x8b5ca290) at repmgr_msg.c:29
                        #21 0x4002acb7 in start_thread () from /lib/tls/libpthread.so.0
                        #22 0x400f221e in clone () from /lib/tls/libc.so.6
                        #23 0xafc6dbb0 in ?? ()

                        (gdb) print msg
                        $1 = (REPMGR_MESSAGE *) 0x84576b8
                        (gdb) print msg->control
                        $2 = {data = 0x84576f8, size = 36, ulen = 0, dlen = 0, doff = 0, app_data = 0x0, flags = 0}
                        (gdb) x/36xb msg->control.data
                        0x84576f8: 0x00 0x00 0x00 0x05 0x00 0x00 0x00 0x0e
                        0x8457700: 0x00 0x00 0x00 0x30 0x00 0x7f 0xb7 0x19
                        0x8457708: 0x00 0x00 0x00 0x0b 0x00 0x00 0x00 0x01
                        0x8457710: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
                        0x8457718: 0x00 0x00 0x00 0x64

                        Which corresponds to lsn: 48, offset: 8369945; Nothing remarkable in the masters log.

                        The address at 0x8be09458 that glibc complained about certainly does not appear to be a replication message.

                        Something is starting to feel very "off" here.

                        Thanks,

                        Neil
                        • 24. Re: Slow network connection possibly causing a slave to crash.
                          524761
                          I've been corresponding with Neil "off-line", but for the benefit of the reading public, this problem turns out to have been a bug in Berkeley DB that caused a commit log record at the master to be formatted incorrectly, leading the client to overwrite memory outside the bounds of the malloc'ed buffer. It was a known bug, #18154, and was fixed in Berkeley DB version 5.0.
                          1 2 Previous Next