Master node in 3-node BDB JE HA cluster unexpectedly fails with: Halted log file reading at file 0x2c51 offset 0x34b793 offset(decimal)=3454867 prev=0x34b4f3

940250

    Hi,

    As per subject the master node in 3-node BDB JE HA group had failed with the following:

     

    2019-01-14 19:07:19,296 ERROR [Feeder Output for node_3] (c.s.j.r.i.RepImpl) - [node_2] Halted log file reading at file 0x2c51 offset 0x34b793 offset(decimal)=3454867 prev=0x34b4f3:

    entry=INS_LN_TXtype=33,version=8)

    prev=0x34b4f3

    size=3883

    Next entry should be at 0x34c6d4

    com.sleepycat.je.EnvironmentFailureException: (JE 5.0.104) want to read 151,849,824 but reader at 151,849,829 UNEXPECTED_STATE: Unexpected internal state, may have side effects.

            at com.sleepycat.je.EnvironmentFailureException.unexpectedState(EnvironmentFailureException.java:426)

            at com.sleepycat.je.rep.stream.FeederReader.checkForPassingTarget(FeederReader.java:288)

            at com.sleepycat.je.rep.stream.FeederReader.isTargetEntry(FeederReader.java:309)

            at com.sleepycat.je.log.FileReader.readNextEntryAllowExceptions(FileReader.java:297)

            at com.sleepycat.je.log.FileReader.readNextEntry(FileReader.java:229)

            at com.sleepycat.je.rep.stream.FeederReader.scanForwards(FeederReader.java:272)

            at com.sleepycat.je.rep.stream.MasterFeederSource.getWireRecord(MasterFeederSource.java:64)

            at com.sleepycat.je.rep.impl.node.Feeder$OutputThread.run(Feeder.java:761)

     

    2019-01-14 19:07:19,298 ERROR [Feeder Output for node_3] (c.s.j.r.i.n.Feeder) - [node_2] Unexpected exception: (JE 5.0.104) want to read 151,849,824 but reader at 151,849,829 UNEXPECTED_STATE: Unexpected internal state, may have side effects. MasterFeederSource fetching vlsn=151,849,824 waitTime=1000com.sleepycat.je.EnvironmentFailureException: (JE 5.0.104) want to read 151,849,824 but reader at 151,849,829 UNEXPECTED_STATE: Unexpected internal state, may have side effects. MasterFeederSource fetching vlsn=151,849,824 waitTime=1000

            at com.sleepycat.je.EnvironmentFailureException.unexpectedState(EnvironmentFailureException.java:426)

            at com.sleepycat.je.rep.stream.FeederReader.checkForPassingTarget(FeederReader.java:288)

            at com.sleepycat.je.rep.stream.FeederReader.isTargetEntry(FeederReader.java:309)

            at com.sleepycat.je.log.FileReader.readNextEntryAllowExceptions(FileReader.java:297)

            at com.sleepycat.je.log.FileReader.readNextEntry(FileReader.java:229)

            at com.sleepycat.je.rep.stream.FeederReader.scanForwards(FeederReader.java:272)

            at com.sleepycat.je.rep.stream.MasterFeederSource.getWireRecord(MasterFeederSource.java:64)

            at com.sleepycat.je.rep.impl.node.Feeder$OutputThread.run(Feeder.java:761)

     

    2019-01-14 19:07:19,299 ERROR [Feeder Output for node_3] (c.s.j.r.i.n.Feeder) - [node_2] Uncaught exception in feeder thread Thread[Feeder Output for node_3,5,main](JE 5.0.104) want to read 151,849,824 but reader at 151,849,829 UNEXPECTED_STATE: Unexpected internal state, may have side effects. MasterFeederSource fetching vlsn=151,849,824 waitTime=1000com.sleepycat.je.EnvironmentFailureException: (JE 5.0.104) want to read 151,849,824 but reader at 151,849,829 UNEXPECTED_STATE: Unexpected internal state, may have side effects. MasterFeederSource fetching vlsn=151,849,824 waitTime=1000

            at com.sleepycat.je.EnvironmentFailureException.unexpectedState(EnvironmentFailureException.java:426)

            at com.sleepycat.je.rep.stream.FeederReader.checkForPassingTarget(FeederReader.java:288)

            at com.sleepycat.je.rep.stream.FeederReader.isTargetEntry(FeederReader.java:309)

            at com.sleepycat.je.log.FileReader.readNextEntryAllowExceptions(FileReader.java:297)

            at com.sleepycat.je.log.FileReader.readNextEntry(FileReader.java:229)

            at com.sleepycat.je.rep.stream.FeederReader.scanForwards(FeederReader.java:272)

            at com.sleepycat.je.rep.stream.MasterFeederSource.getWireRecord(MasterFeederSource.java:64)

            at com.sleepycat.je.rep.impl.node.Feeder$OutputThread.run(Feeder.java:761)

     

     

    It looks like master node "node_2"  failed to read data from the disk in order to send it over to "node_3".

     

    The master node detached after that. The impacted environment was restarted without any issue and was able to operate as Replica.

     

    As per logs the BDB JE version in use is 5.0.104.

     

    What could be root cause of the issue with unexpected read offset ( want to read 151,849,824 but reader at 151,849,829)? Is it an existing defect or some sort of environmental issue?

    If it is a defect, would upgrade to the latest 7.x version can solve the problem?

     

    Kind Regards,

    Alex

     

    Message was edited by: 940250

      • 1. Re: Master node in 3-node BDB JE HA cluster unexpectedly fails with: Halted log file reading at file 0x2c51 offset 0x34b793 offset(decimal)=3454867 prev=0x34b4f3
        3886863

        We are running on 7.4.5 version of JE and still saw this issue in our production environment on Friday. I double if version upgrade is gonna solve this issue.

        • 2. Re: Master node in 3-node BDB JE HA cluster unexpectedly fails with: Halted log file reading at file 0x2c51 offset 0x34b793 offset(decimal)=3454867 prev=0x34b4f3
          Greybird-Oracle

          70b2bf78-1fd8-45d0-85ca-769a33cd0c35, could you please post the full stack trace that you're seeing in JE 7.4.5? It may provide a clue if we compare it to the stack trace of the OP.

           

          I've searched our bug reports and elsewhere, and these two reports are the only reports we've seen of this particular problem.

           

          --mark

          • 3. Re: Master node in 3-node BDB JE HA cluster unexpectedly fails with: Halted log file reading at file 0x2c51 offset 0x34b793 offset(decimal)=3454867 prev=0x34b4f3
            Greybird-Oracle

            Alex / 940250,

             

            If you want to keep pursuing this, there is some information you could dump from the jdb files where the problem occurred, that might help us to narrow this down. Do you still have the 0x2c51 file (00002c51.jdb)?

             

            If so, please run the following command and direct output to a file (it will be large).

             

            java -jar DbPrintLog -h <DIR-WITH-JDB-FILE> -s 0x2c51 -e 0x2c52 > outfile

             

            In the outfile there will be lines that look like this:

             

            <entry lsn="0x2c51/0x115" ...lots of data here.../>

             

            The lsn is the file number / offset of the entry. Look for an entry with offset 0x34b793, or at close to that as you can find, and please post several entries before and after it.

             

            You'll see VLSNs in the output as well and we'll be looking for the 151,849,824 and 151,849,829 VLSN values. One of these VLSNs should be in the entries that I asked for above.

             

            Looking at this should tell us whether the information in the file itself is corrupted, or the problem is elsewhere.

             

            Thanks,

            --mark

            • 4. Re: Master node in 3-node BDB JE HA cluster unexpectedly fails with: Halted log file reading at file 0x2c51 offset 0x34b793 offset(decimal)=3454867 prev=0x34b4f3
              940250

              Hi Mark,

              The jdb file has gone.

              I will try to collect the jdb files on next issue re-occurrence. It happens once or twice per month.

               

              Kind Regards,

              Alex

              • 5. Re: Master node in 3-node BDB JE HA cluster unexpectedly fails with: Halted log file reading at file 0x2c51 offset 0x34b793 offset(decimal)=3454867 prev=0x34b4f3
                3886863

                Hello Mark,

                Below is he entire stack trace of the server. I wanted to try the steps you mentioned above but the JDB files are gone for me as well. I will do that next time it happens.

                2019-02-08 15:43:27.774 UTC SEVERE [tds5_B] Halted log file reading at file 0x6028b9 offset 0x7fb73e offset(decimal)=8369982 prev=0x7fa621:

                entry=UPD_LN_TXtype=35,version=14)

                prev=0x7fa621

                size=223

                Next entry should be at 0x7fb833

                com.sleepycat.je.EnvironmentFailureException: (JE 7.4.5) want to read 1,658,315,645 but reader at 1,658,315,647 UNEXPECTED_STATE: Unexpected internal state, may have side effects.

                        at com.sleepycat.je.EnvironmentFailureException.unexpectedState(EnvironmentFailureException.java:428)

                        at com.sleepycat.je.rep.stream.FeederReader.checkForPassingTarget(FeederReader.java:297)

                        at com.sleepycat.je.rep.stream.FeederReader.isTargetEntry(FeederReader.java:317)

                        at com.sleepycat.je.log.FileReader.readNextEntryAllowExceptions(FileReader.java:332)

                        at com.sleepycat.je.log.FileReader.readNextEntry(FileReader.java:245)

                        at com.sleepycat.je.rep.stream.FeederReader.scanForwards(FeederReader.java:280)

                        at com.sleepycat.je.rep.stream.MasterFeederSource.getWireRecord(MasterFeederSource.java:70)

                        at com.sleepycat.je.rep.impl.node.Feeder$OutputThread.writeAvailableEntries(Feeder.java:1266)

                        at com.sleepycat.je.rep.impl.node.Feeder$OutputThread.run(Feeder.java:1144)

                 

                2019-02-08 15:43:27.822 UTC SEVERE [tds5_B] checkpointId=1658750

                com.sleepycat.je.EnvironmentFailureException: (JE 7.4.5) Environment must be closed, caused by: com.sleepycat.je.EnvironmentFailureException: Environment invalid because of previous exception: (JE 7.4.5) tds5_B(6): com.sleepycat.je.EnvironmentFailureException: (JE 7.4.5) want to read 1,658,315,645 but reader at 1,658,315,647 UNEXPECTED_STATE: Unexpected internal state, may have side effects. MasterFeederSource fetching vlsn=1,658,315,645 waitTime=1000 Uncaught exception in feeder thread:Thread[Feeder Output for tds5_P,5,main] Originally thrown by HA thread: MASTER tds5_B(6) UNCAUGHT_EXCEPTION: Uncaught Exception in internal thread, unable to continue. Environment is invalid and must be closed.

                        at com.sleepycat.je.EnvironmentFailureException.wrapSelf(EnvironmentFailureException.java:228)

                        at com.sleepycat.je.dbi.EnvironmentImpl.checkIfInvalid(EnvironmentImpl.java:1766)

                        at com.sleepycat.je.cleaner.Cleaner.deleteSafeToDeleteFiles(Cleaner.java:669)

                        at com.sleepycat.je.cleaner.Cleaner.updateFilesAtCheckpointEnd(Cleaner.java:1022)

                        at com.sleepycat.je.recovery.Checkpointer.doCheckpoint(Checkpointer.java:910)

                        at com.sleepycat.je.recovery.Checkpointer.onWakeup(Checkpointer.java:559)

                        at com.sleepycat.je.utilint.DaemonThread.run(DaemonThread.java:154)

                        at java.lang.Thread.run(Thread.java:748)

                Caused by: com.sleepycat.je.EnvironmentFailureException: Environment invalid because of previous exception: (JE 7.4.5) tds5_B(6):tds5_B/bdb com.sleepycat.je.EnvironmentFailureException: (JE 7.4.5) want to read 1,658,315,645 but reader at 1,658,315,647 UNEXPECTED_STATE: Unexpected internal state, may have side effects. MasterFeederSource fetching vlsn=1,658,315,645 waitTime=1000 Uncaught exception in feeder thread:Thread[Feeder Output for tds5_P,5,main] Originally thrown by HA thread: MASTER tds5_B(6) UNCAUGHT_EXCEPTION: Uncaught Exception in internal thread, unable to continue. Environment is invalid and must be closed.

                        at com.sleepycat.je.utilint.StoppableThread.handleUncaughtException(StoppableThread.java:243)

                        at com.sleepycat.je.utilint.StoppableThread$UncaughtHandler.uncaughtException(StoppableThread.java:274)

                        at java.lang.Thread.dispatchUncaughtException(Thread.java:1959)

                Caused by: com.sleepycat.je.EnvironmentFailureException: (JE 7.4.5) want to read 1,658,315,645 but reader at 1,658,315,647 UNEXPECTED_STATE: Unexpected internal state, may have side effects. MasterFeederSource fetching vlsn=1,658,315,645 waitTime=1000 Uncaught exception in feeder thread:Thread[Feeder Output for tds5_P,5,main] Originally thrown by HA thread: MASTER tds5_B(6)

                        at com.sleepycat.je.EnvironmentFailureException.unexpectedState(EnvironmentFailureException.java:428)

                        at com.sleepycat.je.rep.stream.FeederReader.checkForPassingTarget(FeederReader.java:297)

                        at com.sleepycat.je.rep.stream.FeederReader.isTargetEntry(FeederReader.java:317)

                        at com.sleepycat.je.log.FileReader.readNextEntryAllowExceptions(FileReader.java:332)

                        at com.sleepycat.je.log.FileReader.readNextEntry(FileReader.java:245)

                        at com.sleepycat.je.rep.stream.FeederReader.scanForwards(FeederReader.java:280)

                        at com.sleepycat.je.rep.stream.MasterFeederSource.getWireRecord(MasterFeederSource.java:70)

                        at com.sleepycat.je.rep.impl.node.Feeder$OutputThread.writeAvailableEntries(Feeder.java:1266)

                        at com.sleepycat.je.rep.impl.node.Feeder$OutputThread.run(Feeder.java:1144)

                • 6. Re: Master node in 3-node BDB JE HA cluster unexpectedly fails with: Halted log file reading at file 0x2c51 offset 0x34b793 offset(decimal)=3454867 prev=0x34b4f3
                  Greybird-Oracle

                  Thanks for this additional info. It is interesting that the VLSNs are only 2 apart, rather then 4 apart in the other case.