5 Replies Latest reply on Jul 2, 2010 6:41 PM by 807557

    Core Dump on Solaris 10 (Signal 10 - Bus Error), but not on Solaris 8?

    807557
      Hi,

      We just moved our product from Solaris 8 to Solaris 10. It runs for months on Solaris 8 without any problems, while core dumped after running about 2 weeks on Solaris 10.

      Any clue on what could be wrong is apprecaited.

      pam
        • 1. Re: Core Dump on Solaris 10 (Signal 10 - Bus Error), but not on Solaris 8?
          800381
          While it's impossible to be certain with absolutely no details, the most likely cause is a latent bug in your application that is exposed when it runs in a different environment.

          If you want more than that, you'll have to post some details.
          • 2. Re: Core Dump on Solaris 10 (Signal 10 - Bus Error), but not on Solaris 8?
            807557
            Hi Andrew,

            Appreciate your answer very much. I am very new to Solaris and UNIX in general. Would you please let me know what kind of info would help diagnose the
            problem? I have stack pointer, output of "where" from gdb. frme pointer, etc.

            pam

            ===================
            GNU gdb 6.3
            Copyright 2004 Free Software Foundation, Inc.
            GDB is free software, covered by the GNU General Public License, and you are
            welcome to change it and/or distribute copies of it under certain conditions.
            Type "show copying" to see the conditions.
            There is absolutely no warranty for GDB. Type "show warranty" for details.
            This GDB was configured as "sparc-sun-solaris2.8"...(no debugging symbols found)

            Core was generated by `./warnsrvr'.
            Program terminated with signal 10, Bus error.
            #0 0x001a3ca8 in __1cSWPReferenceManagerMInputUtcInfo6MrknIUTC_INFO_khki_nGRESULT__ ()
            (gdb) where
            #0 0x001a3ca8 in __1cSWPReferenceManagerMInputUtcInfo6MrknIUTC_INFO_khki_nGRESULT__ ()
            #1 0x001a3ca8 in __1cSWPReferenceManagerMInputUtcInfo6MrknIUTC_INFO_khki_nGRESULT__ ()
            Previous frame identical to this frame (corrupt stack?)

            ======================================

            (gdb) disassemble 0x001a3ca8
            Dump of assembler code for function __1cSWPReferenceManagerMInputUtcInfo6MrknIUTC_INFO_khki_nGRESULT__:
            0x001a3c24 <__1cSWPReferenceManagerMInputUtcInfo6MrknIUTC_INFO_khki_nGRESULT__+0>: cmp %o0, 1
            0x001a3c28 <__1cSWPReferenceManagerMInputUtcInfo6MrknIUTC_INFO_khki_nGRESULT__+4>: be,pn %icc, 0x1a3c38 <__1cSWPReferenceManagerMInputUtcInfo6MrknIUTC_INFO_khki_nGRESULT__+20>
            0x001a3c2c <__1cSWPReferenceManagerMInputUtcInfo6MrknIUTC_INFO_khki_nGRESULT__+8>: sethi %hi(0x572000), %l6
            0x001a3c30 <__1cSWPReferenceManagerMInputUtcInfo6MrknIUTC_INFO_khki_nGRESULT__+12>: ret
            0x001a3c34 <__1cSWPReferenceManagerMInputUtcInfo6MrknIUTC_INFO_khki_nGRESULT__+16>: restore %g0, 0, %o0
            0x001a3c38 <__1cSWPReferenceManagerMInputUtcInfo6MrknIUTC_INFO_khki_nGRESULT__+20>: ld [ %l6 + 0x358 ], %l5
            0x001a3c3c <__1cSWPReferenceManagerMInputUtcInfo6MrknIUTC_INFO_khki_nGRESULT__+24>: cmp %l5, 0
            0x001a3c40 <__1cSWPReferenceManagerMInputUtcInfo6MrknIUTC_INFO_khki_nGRESULT__+28>: bne,pn %icc, 0x1a3c50 <__1cSWPReferenceManagerMInputUtcInfo6MrknIUTC_INFO_khki_nGRESULT__+44>
            0x001a3c44 <__1cSWPReferenceManagerMInputUtcInfo6MrknIUTC_INFO_khki_nGRESULT__+32>: cmp %l5, 1
            0x001a3c48 <__1cSWPReferenceManagerMInputUtcInfo6MrknIUTC_INFO_khki_nGRESULT__+36>: ret
            0x001a3c4c <__1cSWPReferenceManagerMInputUtcInfo6MrknIUTC_INFO_khki_nGRESULT__+40>: restore %g0, 1, %o0
            0x001a3c50 <__1cSWPReferenceManagerMInputUtcInfo6MrknIUTC_INFO_khki_nGRESULT__+44>: bne,pn %icc, 0x1a3cb0 <__1cSWPReferenceManagerOInputIonoModel6MrknKIONO_MODEL_khki_nGRESULT__+4>
            0x001a3c54 <__1cSWPReferenceManagerMInputUtcInfo6MrknIUTC_INFO_khki_nGRESULT__+48>: sethi %hi(0x1a3c00), %l7
            0x001a3c58 <__1cSWPReferenceManagerMInputUtcInfo6MrknIUTC_INFO_khki_nGRESULT__+52>: mov -1, %i1
            0x001a3c5c <__1cSWPReferenceManagerMInputUtcInfo6MrknIUTC_INFO_khki_nGRESULT__+56>: sth %i1, [ %fp + -1864 ]
            0x001a3c60 <__1cSWPReferenceManagerMInputUtcInfo6MrknIUTC_INFO_khki_nGRESULT__+60>: add %fp, -1824, %o0
            0x001a3c64 <__1cSWPReferenceManagerMInputUtcInfo6MrknIUTC_INFO_khki_nGRESULT__+64>: ldd [ %l7 + 8 ], %f0
            0x001a3c68 <__1cSWPReferenceManagerMInputUtcInfo6MrknIUTC_INFO_khki_nGRESULT__+68>: call 0x1c87f0 <___const_seg_900001301+16>
            0x001a3c6c <__1cSWPReferenceManagerMInputUtcInfo6MrknIUTC_INFO_khki_nGRESULT__+72>: std %f0, [ %fp + -1856 ]
            0x001a3c70 <__1cSWPReferenceManagerMInputUtcInfo6MrknIUTC_INFO_khki_nGRESULT__+76>: call 0x1cf888 <__1cKIono2Ascii6FrknKIONO_MODEL_pcki_v_+100>
            0x001a3c74 <__1cSWPReferenceManagerMInputUtcInfo6MrknIUTC_INFO_khki_nGRESULT__+80>: add %fp, -1864, %o0
            0x001a3c78 <__1cSWPReferenceManagerMInputUtcInfo6MrknIUTC_INFO_khki_nGRESULT__+84>: sllx %i2, 0x30, %o1
            0x001a3c7c <__1cSWPReferenceManagerMInputUtcInfo6MrknIUTC_INFO_khki_nGRESULT__+88>: mov %i0, %o0
            0x001a3c80 <__1cSWPReferenceManagerMInputUtcInfo6MrknIUTC_INFO_khki_nGRESULT__+92>: srax %o1, 0x30, %o1
            0x001a3c84 <__1cSWPReferenceManagerMInputUtcInfo6MrknIUTC_INFO_khki_nGRESULT__+96>: call 0x182cb0 <__1cTGenReferenceManagerOGetCorrections6MrknKCLSGpsTime_rknICLSCoord_rnOCORRECTION_SET__nGRESULT__+3504>
            0x001a3c88 <__1cSWPReferenceManagerMInputUtcInfo6MrknIUTC_INFO_khki_nGRESULT__+100>: add %fp, -1864, %o2
            0x001a3c8c <__1cSWPReferenceManagerMInputUtcInfo6MrknIUTC_INFO_khki_nGRESULT__+104>: cmp %o0, 1
            0x001a3c90 <__1cSWPReferenceManagerMInputUtcInfo6MrknIUTC_INFO_khki_nGRESULT__+108>: be,pn %icc, 0x1a3ca0 <__1cSWPReferenceManagerMInputUtcInfo6MrknIUTC_INFO_khki_nGRESULT__+124>
            0x001a3c94 <__1cSWPReferenceManagerMInputUtcInfo6MrknIUTC_INFO_khki_nGRESULT__+112>: mov %i0, %o0
            0x001a3c98 <__1cSWPReferenceManagerMInputUtcInfo6MrknIUTC_INFO_khki_nGRESULT__+116>: ret
            0x001a3c9c <__1cSWPReferenceManagerMInputUtcInfo6MrknIUTC_INFO_khki_nGRESULT__+120>: restore %g0, 0, %o0
            0x001a3ca0 <__1cSWPReferenceManagerMInputUtcInfo6MrknIUTC_INFO_khki_nGRESULT__+124>: call 0x1a7068 <__1cSWPReferenceManagerPAdjustXTRATimes6MpnZCLSGnssSatellitePredictor_khrnKCLSGpsTime_rd_nGRESULT__+3584>
            0x001a3ca4 <__1cSWPReferenceManagerMInputUtcInfo6MrknIUTC_INFO_khki_nGRESULT__+128>: add %fp, -1864, %o1
            0x001a3ca8 <__1cSWPReferenceManagerMInputUtcInfo6MrknIUTC_INFO_khki_nGRESULT__+132>: ret
            End of assembler dump.

            =====================================
            (gdb) info registers
            g0 0x0 0
            g1 0xfd77ecb8 -42472264
            g2 0x11 17
            g3 0xecf0 60656
            g4 0xfd77e304 -42474748
            g5 0xfc 252
            g6 0x0 0
            g7 0xfeda4200 -19250688
            o0 0x1 1
            o1 0x20 32
            o2 0x693500 6894848
            o3 0xecf0 60656
            o4 0x6a21f0 6955504
            o5 0x1 1
            sp 0xfd77ec38 0xfd77ec38
            o7 0x1a3ca0 1719456
            l0 0x1b000021 452984865
            l1 0x2ca2a40 46803520
            l2 0x40173076 1075261558
            l3 0x57d22e16 1473392150
            l4 0x3fa80492 1067975826
            l5 0x3c06fe49 1007091273
            l6 0x4072b4a5 1081259173
            l7 0x410d711d 1091399965
            i0 0x421c0000 1109131264
            i1 0xffffffff -1
            i2 0x1d000018 486539288
            i3 0x2ca2808 46802952
            i4 0x40138e4a 1075023434
            i5 0x8b122e16 -1961742826
            fp 0xbfb17c3b 0xbfb17c3b
            i7 0xa817f0db -1474826021
            y 0x3 3
            psr 0xfe401007 -29356025
            wim 0x0 0
            tbr 0x0 0
            pc 0x1a3ca8 0x1a3ca8 <__1cSWPReferenceManagerMInputUtcInfo6MrknIUTC_INFO_khki_nGRESULT__+132>
            npc 0x1a3cac 0x1a3cac <__1cSWPReferenceManagerOInputIonoModel6MrknKIONO_MODEL_khki_nGRESULT__>
            fsr 0x400420 4195360
            csr 0x0 0
            • 3. Re: Core Dump on Solaris 10 (Signal 10 - Bus Error), but not on Solaris 8?
              807557
              continued from last message, since it exceeded max allowed number of chars.

              ===================
              =========================
              (gdb) x /32xw $sp
              0xfd77ec38: 0x1b000021 0x02ca2a40 0x40173076 0x57d22e16
              0xfd77ec48: 0x3fa80492 0x3c06fe49 0x4072b4a5 0x410d711d
              0xfd77ec58: 0x421c0000 0xffffffff 0x1d000018 0x02ca2808
              0xfd77ec68: 0x40138e4a 0x8b122e16 0xbfb17c3b 0xa817f0db
              0xfd77ec78: 0x4074f394 0x411189e8 0x42380000 0x00860000
              0xfd77ec88: 0x00001000 0xffffffff 0xffffffff 0xffffffff
              0xfd77ec98: 0x003b6196 0x00000000 0x0633ffff 0xfd77f940
              0xfd77eca8: 0x411d0e20 0x00000000 0x00000001 0x00000002
              ================================
              (gdb) x $fp
              0xbfb17c3b: Cannot access memory at address 0xbfb17c3b
              • 4. Re: Core Dump on Solaris 10 (Signal 10 - Bus Error), but not on Solaris 8?
                800381
                It looks like something overwrote the return address of the method the app is in when it failed. Since the return address is stored on the stack, it's almost certainly a local variable being accessed incorrectly. In my experience the most likely cause is incorrect string manipulation with calls like strcpy() or sprintf() putting too many characters into a fixed-length string, or a boundary-condition failure on some array.

                Hopefully that's enough for you to find the problem. Maybe you have file or directory names that are longer on the new Solaris 10 server? Maybe some environment variable that you use has gotten longer? Maybe you're running with a limit on your stack size and something in Solaris 10 takes up more stack space than running under Solaris 8 did?

                If you can't find it manually, you can use some memory-checking tools. It's going to be hard to find using the tools that come with Solaris and Sun Studio, however. All the memory-checking tools there are focused on heap memory that are managed with malloc()/free() or new/delete and not local variables and stack memory. IBM's (originally Rational) Purify does a great job of finding these kinds of problems, but it's not an easy tool to use effectively. In fact, it can find so many errors that I've noted programmers ignoring its findings because they believed there was no way their code could be that bad. FWIW, it was that bad.
                • 5. Re: Core Dump on Solaris 10 (Signal 10 - Bus Error), but not on Solaris 8?
                  807557
                  It turns out that it does not have much to do with Solaris 10. One of the arrays is overrun. After we built release version with debug symbol, we identified the exact array. Appreciate your help very much.