5 Replies Latest reply on Dec 7, 2016 11:56 AM by ChrisJenkins-Oracle

    TT recovery failed

    user5106620

      Hello,

       

      TimesTen 11.2.2.8.13 x64 on windows r2.

       

      TT failed due to filesystem was full because TT created a lot of  transaction *.log files (do not know why)...

      after that i disconnected all clients and set all TT policies to manual then

      ttdaemonadmin -stop,

       

      then increased filesystem size and now:

       

      ttdaemonadmin -start

       

      ttstatus:

       

      Data store c:\timesten\datastore\ttlogic2\data

      There are no connections to the data store

      RAM residence policy: Manual

      Data store is manually unloaded from RAM

      Replication policy  : Manual

      Cache Agent policy  : Manual

      PL/SQL enabled.

       

      i am trying to load data in memory:

       

      ttadmin -ramLoad dsn=ttlogic2

       

      but getting repeated fragments in errors.log:

       

      15:42:49.69 Warn:    :  2108: 3936/00000000009CC430: RECOVERY: Recovery triggered by fuzzy checkpoint.

      15:44:04.00 Warn:    :  2108: 3936/00000000009CC430: RECOVERY: Recovery started

      15:46:32.43 Err :    :  2108: 3936/00000000009CC430: Assertion failed: (!(((sbHBHdr_p) bufp)->flags & (((sbHBFlag_t) 0x1)|((sbHBFlag_t) 0x2)))) [heap.c:/st_timesten_11.2.2.8/7:sbLCBHpCoalesce:9802] PID 3936 (timestensubd1122) CONN 175 (Manager) 2016-11-21 15:46:32.437

      15:46:32.49 Err :    :  2108: 3936/00000000009CC430: Recovery failed during the redo operation of sbLRHpCoalesce at LSN=2433.322931872

      15:46:32.51 Err :    :  2108: 3936/00000000009CC430: Data store marked invalid [heap.c:/st_timesten_11.2.2.8/7:sbLCBHpCoalesce:9802] PID 3936 (timestensubd1122) CONN 175 (Manager) Context 0x9cc430

      15:46:32.71 Warn:    :  2108: 3936/00000000009CC430: RECOVERY: sbDbRedo: Redo phase encountered the following errors/warnings:

      15:46:32.71 Warn:    :  2108: 3936/00000000009CC430: TT0994: Data store connection terminated. Please reconnect. -- file "heap.c", lineno 9802, procedure "sbLCBHpCoalesce"

      15:46:32.71 Warn:    :  2108: 3936/00000000009CC430: TT4053: Internal error: Assertion failed: (!(((sbHBHdr_p) bufp)->flags & (((sbHBFlag_t) 0x1)|((sbHBFlag_t) 0x2)))) [heap.c:/st_timesten_11.2.2.8/7:sbLCBHpCoalesce:9802] PID 3936 (timestensubd1122) CONN 175 (Manager) 2016-11-21 15:46:32.437 -- file "util.c", lineno 707, procedure "sbUtAssertionReport"

      15:46:32.71 Err :    :  2108: 3936/00000000009CC430: RECOVERY: recovery failed on redo of LSN 2433.322931872

      15:46:32.71 Warn:    :  2108: 3936/00000000009CC430: Giving up on file 0 because of recovery failure.

      15:46:32.71 Err :    :  2108: 3936/00000000009CC430: Errors/warnings from previous recovery attempt follow.

      15:46:32.71 Err :    :  2108: 3936/00000000009CC430: TT0994: Data store connection terminated. Please reconnect. -- file "heap.c", lineno 9802, procedure "sbLCBHpCoalesce"

      15:46:32.71 Err :    :  2108: 3936/00000000009CC430: TT4053: Internal error: Assertion failed: (!(((sbHBHdr_p) bufp)->flags & (((sbHBFlag_t) 0x1)|((sbHBFlag_t) 0x2)))) [heap.c:/st_timesten_11.2.2.8/7:sbLCBHpCoalesce:9802] PID 3936 (timestensubd1122) CONN 175 (Manager) 2016-11-21 15:46:32.437 -- file "util.c", lineno 707, procedure "sbUtAssertionReport"

      15:46:33.45 Err :    :  3936: subd: Error identified in [sub.c: line 2490]

      15:46:33.45 Err :    :  3936: subd: (Error 994): TT0994: Data store connection terminated. Please reconnect. -- file "heap.c", lineno 9802, procedure "sbLCBHpCoalesce"

      15:46:33.45 Err :    :  3936:  -- file "heap.c", lineno 9802, procedure "sbLCBHpCoalesce"

      15:46:33.45 Err :    :  3936: subd: (Error 4053): TT4053: Internal error: Assertion failed: (!(((sbHBHdr_p) bufp)->flags & (((sbHBFlag_t) 0x1)|((sbHBFlag_t) 0x2)))) [heap.c:/st_timesten_11.2.2.8/7:sbLCBHpCoalesce:9802] PID 3936 (timestensubd1122) CONN 175 (Manager) 2016-11-21 15:46:32.437 -- file "util.c", lineno 707, procedure "sbUtAssertionReport"

      15:46:33.45 Err :    :  3936:  -- file "util.c", lineno 707, procedure "sbUtAssertionReport"

      15:46:33.45 Err :    :  3936: subd: (Error 848): TT0848: Recovery failed on 1 set(s) of data store files; the TimesTen user error log has more information -- file "db.c", lineno 10535, procedure "sbDbConnect"

      15:46:33.45 Err :    :  3936:  -- file "db.c", lineno 10535, procedure "sbDbConnect"

      15:46:33.45 Warn:    :  3936: subd: connect trouble, rc 1, reason 994

      15:46:33.45 Err :    :  3936: Err  994: TT0994: Data store connection terminated. Please reconnect. -- file "heap.c", lineno 9802, procedure "sbLCBHpCoalesce"

      15:46:33.45 Err :    :  3936: Err  4053: TT4053: Internal error: Assertion failed: (!(((sbHBHdr_p) bufp)->flags & (((sbHBFlag_t) 0x1)|((sbHBFlag_t) 0x2)))) [heap.c:/st_timesten_11.2.2.8/7:sbLCBHpCoalesce:9802] PID 3936 (timestensubd1122) CONN 175 (Manager) 2016-11-21 15:46:32.437 -- file "util.c", lineno 707, procedure "sbUtAssertionReport"

      15:46:33.45 Err :    :  3936: Err  848: TT0848: Recovery failed on 1 set(s) of data store files; the TimesTen user error log has more information -- file "db.c", lineno 10535, procedure "sbDbConnect"

      15:46:33.45 Err :    :  2108: TT14000: TimesTen daemon internal error: Could not send 'manage' request to subdaemon rc 400 err1 703 err2 994

      15:46:33.45 Err :    :  2108: TT14000: TimesTen daemon internal error: Could not manage data store c:\timesten\datastore\ttlogic2\data as required by policy.  Return code 1 Error 703 err2 994 message 'Failed in connect'

      15:46:33.99 Warn:    :  2108: 3936 ------------------: subdaemon process exited

       

       

      How to restore TT ?

       

      Thanks in advance,

      Andrey

        • 1. Re: TT recovery failed
          ChrisJenkins-Oracle

          Transaction log files contain log records (redo and undo) that record changes to persistent data to allow for rollback (undo), recovery, replication, XLA, AWT caching etc. If a lot of files were being produced then this would be because of heavy write activity on the TimesTen database due to application activity, cache auto refresh activity, replication etc. Transaction log files are automatically by the checkpoint mechanism when they are no longer required. In some circumstances transaction log files may accumulate. Reasons for this might include:

           

          1.   Incorrect checkpointing parameter configuration.

           

          2.   Applications issuing very large transactions (for example updating 100s of thousands or millions of rows in a single transaction).

           

          3.   Incorrect application logic resulting in application not issuing a commit call resulting a 'long running' transaction.

           

          4.   Incorrect usage of the XLA or JMS/XLA APIs.

           

          5.   Incorrect operation of the replication or AWT cache features.

           

          You can determine what is holding onto transaction log files using the ttLogHolds built in procedure.

           

          Your recovery problem is due to an assertion occurring during recovery. This may be due to a bug or it may be due to invalid data in a checkpoint or log file on disk. If you have a recent backup of your database you will be able to recover by restoring the database from that backup (of course you will lose any changes made since the backup was created). Alternatively you can log an SR with support and they may be able to help you recover your database.

           

          Chris

          • 2. Re: TT recovery failed
            user5106620

            Thanks for reply,

             

            i have several cache groups, and one of them is caching a relatively big table in main database, is it possible that large transaction on main database produce a lot of log files on the TimesTen side (with cache updates)?

             

            Andrey

            • 3. Re: TT recovery failed
              ChrisJenkins-Oracle

              Hi Andrey,

               

              Large updates on the Oracle database, especially if they are done as very large individual transactions, could certainly be a factor. You need to check/tune various things:

               

              -   Cache group auto refresh interval

              -   Size of transactions against Oracle database

              -   Checkpointing parameters in TimesTen

               

              and of course you need to make sure there is enough disk space for the largest possible amount of transaction log that will ever be generated in Timesten plus some extra for headroom.

               

              Having said all of that, TimesTen should still recover correctly after such a failure so I would encourage you to contact support for them to diagnose the recovery issue.

               

              Chris

              • 4. Re: TT recovery failed
                user5106620

                Hi Chris,

                 

                is it possible to manually purge some .log files (using some TT tools)  and how?

                i have situation when number of files are growing...

                 

                call ttLogHolds;

                < 698, 516167944, Replication                   , ROUTER3:_ORACLE:1 >

                < 711, 28424192, Checkpoint                    , data.ds1 >

                < 711, 40456192, Checkpoint                    , data.ds0 >

                < 711, 40669448, Replication                   , ROUTER3:_ORACLE:2 >

                < 711, 40675592, Replication                   , ROUTER3:_ORACLE:0 >

                < 711, 40675592, Replication                   , ROUTER3:_ORACLE:3 >

                 

                I have 14 files - 698, 699,... 711.

                something wrong

                 

                Andrey

                • 5. Re: TT recovery failed
                  ChrisJenkins-Oracle

                  No, it isn't. Those log files are needed and if you were to purge then something would break. One can see from the ttLogHolds output that you are using AWT cache groups and one of the tracks is way behind the others. I suggest that you examine the AWTERRS file in the directory with the checkpoint files to see why this may be. Possibly you have some massive transaction, or some repeated error condition that is blocking/slowing AWT propagation. Once the problem is rectified and AWT propagation can progress again then the files will be safely purged by checkpointing. In extremis you could drop the AWT cache group to clear the log hold but of course you would then lose any changes that had not yet propagated to Oracle.

                   

                  Chris