1 2 Previous Next 24 Replies Latest reply: May 12, 2014 1:05 AM by vladodias RSS

    Transaction and Message handling in High available Active-Active environment

    Sridhar-SOA

      Hi,

       

      The following is our proposed system setup:

      2 nodes ( node 1 and node2), one shared storage ( SAN1) mounting the admins server and soa_clustered related files( as mentioned in EDG),oracel rac database , soa_server1,soa_server2 in node 1 and node2 respectively. soa_cluster on soa_server1,soa_server2 in Active-Active. ( Just like EDG setup).

       

      Questions :

       

      1. Let say I deploy a bpel service ( lets say service A) to the soa_cluster. A client called service A and an instance ( lets say instance 1) is running on soa_server1 ( allocated by cluster based on lb algorithm).

         The soa_server1 crashes. What happens to the instance 1 ?

      a. If whole server migration and service migrations are not setup

      b. if whole server migration is setup.

      Does the transaction continue just from where it was last (or) does it rollback and then restart? (or) any other ?

       

      2. What would client of the service see in both the above cases?

       

      3. What is the service A is a singleton service ? (or) What happens if service A consumes a singleton DB Adapter ? ( lets say it read a row from a db table and processing at the time server crashed)

       

      4. What happens if Service A consumes a message from a Topic and is processing the message by the time soa_server1 crashes?

       

      If you further have questions like - what is the state of the process by the time server crashes ie, did dehydration happen or not . Please consider both the cases and answer if there is any dependancy with that.

       

      Thanks and regards,

      Sridhar.

        • 1. Re: Transaction and Message handling in High available Active-Active environment
          vladodias

          Hi Sridhar,

           

          As a generic answer you can assume that if the node where the instance is running crashes, the instance will crash. How it will recover, will depend on the way you configured high availability.

           

          If managed servers fail, Node Manager tries to restart them locally. If the whole server migration is configured and repeated restarts fail, the WebLogic Server infrastructure performs server migration to the other node in the cluster, if it is configured. After the server on the other node restarts, Oracle HTTP Server resumes routing any incoming requests to it. The migrated server reads the SOA database, resumes any pending processing, and resumes transactions from the transaction logs in shared storage.

           

          In addition, for the handling of failures in the BPEL Engine itself with the WebLogic Server infrastructure, you can define and perform fault recovery actions on BPEL process faults identified as recoverable in Oracle Enterprise Manager Fusion Middleware Control. The recovery actions you perform on faults are based on actions you defined in your fault recovery policy files for BPEL process service components.

           

          Similar to other adapters, an Oracle Database Adapter can also be configured for singleton behavior within an active-passive setup. The Oracle Database Adapter also supports the high availability feature when there is a database failure or restart. The DB adapter picks up again without any message loss.

           

          Check this...

          http://docs.oracle.com/cd/E28280_01/core.1111/e10106/ha_soa.htm

           

          Cheers,

          Vlad

          • 2. Re: Transaction and Message handling in High available Active-Active environment
            Sridhar-SOA
            if the node where the instance is running crashes, the instance will crash. How it will recover, will depend on the way you configured high availability.

            Vlad, Thanks for your answer . It helps understanding the process to an extent.

            So does the instance crash if we don't configure server migration ? In that case can you explain me the configuration that enables it to recover in the other managed server ( which is healthy) ? I assume when you say recover it meant the old transaction rollsback and a new transaction starts for the same request of the client.

            • 3. Re: Transaction and Message handling in High available Active-Active environment
              vladodias

              The instance will crash if the node where it is running crashes, whether server migration is configured or not. If the node is able to restart, it will try and resume the instance if possible.

               

              If the node is done for good and you have an migration configured, the transaction can potentially be resumed in another node. If you're not able to migrate, the transaction is likely to be rolled back and will have to be restarted from the client side.

               

              There's so many dependencies on this... is the process synchronous or asynchronous? what was the status of the instance when stopped? what was the type of activity it was executing when stopped?

               

              When I say recover it can be either start again from the client side OR resumed, i.e., continue from where it stopped.

               

              You've got a general idea, but the only way to know for sure is testing it, mate...

               

              Cheers,

              Vlad

              • 4. Re: Transaction and Message handling in High available Active-Active environment
                Sridhar-SOA

                Vlad,

                Thanks much for the response. I was trying to test the same and am unsuccessful doing so.

                You said -

                "The instance will crash if the node where it is running crashes, whether server migration is configured or not. If the node is able to restart, it will try and resume the instance if possible."

                 

                The way it resumes the instance that got crashed in between is by pulling the entry in the tlogs.

                However not all crashed instances are making their way to tlogs.

                Eg : 1. BPEL instance started 2. dehydrated 3. server is brought down before this instance is completed 4. server brought back 5. Nothing happenned.

                 

                I expected this instance to be resumed when the server is brought back. The reason the instance was not resumed are ( these are my assumptions . let me know if they are right)

                a. It was rolled back.

                b. only the entries in tlogs will be recovered and resumed and this instance never made its entry into tlogs.

                 

                After some reading, I understand that tlog entries are made only when a global transaction is involved and it completes heuristically.

                 

                My questions :

                1. Server migration is only helpful for global transactions that got compelted heuristically ? not for a normal transaction ( told inthe example above) that got crashed in between ?

                2. Why dont I get an entry in tlogs for this scenario?

                Eg: 1. instance started 2. database 1 insert. 3. nested for loop ( that takes 30 secs) 4. database 2 insert 5. instance completed.

                server was shutdown at step 3 above. Tlog entries are still not made . Why ?

                What are some cases where i can get these entries.. please help

                • 5. Re: Transaction and Message handling in High available Active-Active environment
                  vladodias

                  Hi Sridhar,

                   

                  Please check two things:

                   

                  See if you can find the instances on BPEL Recovery Console in EM... You should be able to see the faults there and potentially recover them (manually) ...

                  http://docs.oracle.com/cd/E28280_01/admin.1111/e10226/bp_mang.htm#BABCEIBC

                  and

                  http://docs.oracle.com/cd/E28280_01/admin.1111/e10226/soacompapp_mon.htm#CJHCDCDC

                   

                  Then check, if the auto-recovery is enabled... pay special attention to StartupScheduleConfig...

                  http://docs.oracle.com/cd/E28280_01/admin.1111/e10226/bp_config.htm#CEGIGCDE

                   

                  Cheers,

                  Vlad

                  • 6. Re: Transaction and Message handling in High available Active-Active environment
                    Sridhar-SOA

                    Vlad, well recovery is possible manually or automatically using config that you mentioned above.

                    But my point of interest is fail over and server migration. More specifically to understand the need for server migration in a practical way. Ie., how jta/jms are not highly available and how server migration is the only way helping resolve this. Ie., how fail over is not able to solve it etc using practical SOA  examples.

                    Any help is deeply appreciated.

                    • 7. Re: Transaction and Message handling in High available Active-Active environment
                      vladodias

                      I'm probably missing the point of your question, but I think we need to clarify some concepts... High availability: the service from external perspective is always available, even when part of the infrastructure is unavailable. Failover and server migration are technologies used to provide high availability

                      http://docs.oracle.com/cd/E28280_01/core.1111/e10106/intro.htm#BABDGDJI


                      > Eg : 1. BPEL instance started 2. dehydrated 3. server is brought down before this instance is completed 4. server brought back 5. Nothing happenned.

                      > I expected this instance to be resumed when the server is brought back

                      I believe the reason it was not resumed is because it was not configured to do so... But this example doesn't have much to do with high availability...

                       

                      > need for server migration in a practical way

                      If managed servers fail, Node Manager tries to restart them locally. If the whole server migration is configured and repeated restarts fail, the WebLogic Server infrastructure performs server migration to the other node in the cluster.

                      Thus, server migration is for cases when a specific node is gone for good can NOT be restarted, so it can be migrated to another node...

                      http://docs.oracle.com/cd/E28280_01/core.1111/e10106/ha_soa.htm#CHDJGBAA


                      > how jta/jms are not highly available

                      why do you believe they can't be highly available?

                      • 8. Re: Transaction and Message handling in High available Active-Active environment
                        Sridhar-SOA

                        JTA Recovery Service and JMS are called Pinned services pinned to only one server in the cluster. They are singleton services that are not replicable across cluster. Hence they are  not high available just with clustering unlike other services. The only way these services ( JMS and JTA Recovery Services) be high available is Whole Server migration(WSM) which is the main purpose of intriducing the server migration concept.

                        Refer - 2 High Availability Concepts (12g Release 1 (12.1.2)) Section 2.7.1 to understand more.

                        So all I am trying to understand is to see what it means by JTA recovery service of a server not being available when that server goes down by using examples.

                         

                        Regarding the above example I have taken ( bpel gettign dehydrated, waits in loop and server shut down). Its more about understanding how the instance behaves when server starts up ( woth out any recovery settings). Does this kind of instance store in tlog as an inflight transaction ? ( I have understood that only global transactions involving more than one resource will only make entry in tlogs). Only when you get an pending trasnaction in tlogs , is then when the server recovers it automatically ( with out any recovery config setup). This behaves in the same way with WSM.

                         

                         

                        On the other hand , recoveryConfig ( auto or manual) is completely a different thing which I don't want to get into in this context.

                         

                        In short, I want to understand the "purpose" of enabling  Whole server migration more from the practical examples. In what cases do we need to propose this concept to the clients. Not every client may need it..etc..

                        • 9. Re: Transaction and Message handling in High available Active-Active environment
                          AbhishekJ

                          Pinned services run on a single managed server in singleton mode. These services actually serve the whole cluster but can run on only one managed server at any time. When that managed server goes down, these services will fail, causing issues across the cluster. This is why migratable servers are set up so that these pinned services automatically migrate to another running managed server when the original server goes down.

                          • 11. Re: Transaction and Message handling in High available Active-Active environment
                            Sridhar-SOA

                            Vlad,

                            Thanks for the link. Do you have any example for JTA recovery service migration ?  The above link is for JMS service migration.

                            And btw, Service migration is not supported in Oracle SOA and OSB deployed in weblogic due to some bugs - A note from oracle below

                            "For service level migration support on SOA, there is a enhancement requests (SOA 11g: BUG:13447082) raised and for OSB (OSB 11g: BUG:13446665) as well. There are under development team review and based on the feasibility, this feature will be included in future releases."

                            • 12. Re: Transaction and Message handling in High available Active-Active environment
                              Sridhar-SOA

                              Did the below and am surprised to see the results

                               

                              1. Created a BPEL service that consumes a message from a DemoOutQueue and produces it into DemoInQueue. I have introduced a delay of roughly 30 seconds before the message is plced into to the DemoInQueue. Note : I have used nested for loops to introduce the delay as it would not dehydrate the instance. A use of wait() activity would have dehydrated the instance which I didn't want to. The for loop would run for approx 30 seconds ( which i know based on multiple runs).

                              2. Bpel is deployed to the SOA_Cluster. Both the servers WLS_SOA1 , WLS_SOA2 in my SOA_Cluster are up and running.

                              3. Placed  5 messages ( having some id's 1,2,3,4,5) on to the DemoOutQueue.

                              4. After 10 seconds I have killed WLS_SOA1.

                              Note : Both DemoOutQueue and DemoInQueue are Uniform distributed queues. All are XA enabled.

                               

                              I expected the following:

                              a. Messages placed on the DemoOutQueue of WLS_SOA2 would be processed with out any problems. Those messages will come to DemoInQueue with no problems.

                              Actual outcome: as expected. ( messages 2,4 are in demoInQueue of WLS_SOA2)

                              b. Messages placed on the DemoOutQueue of WLS_SOA1 would NOT be processed as the server was shutdown in between( ie., after the bpel picked them up from OutQueue but not have yet placed in InQueue). Those messages will NOT come to DemoInQueue with no problems.

                              Actual outcome: as expected.

                              ===so far so good===

                              c. WLS_SOA1 is still in shutdown status. Messages 1,3,5 went to recovery console as undelivered as expected. Now I tried to recover them from BPEL recovery console of WLS_SOA2. All of them recovered successfully.

                              Expected :  None of them to be recovered as those messages are initially meant for WLS_SOA1 server. JMS and JTA recovery services being "pinned" services and the messages/txns that are in-flight by one server cannot be processed by the other.

                               

                              Question : How could the messages 1,3,5 processed successfully by WLS_SOA2 recovery service as according to the understanding that JTA and JMS cant be recovered by the other server which is why whole server migration is recommended ?

                               

                              Help deeply appreciated !!

                              • 13. Re: Transaction and Message handling in High available Active-Active environment
                                vladodias

                                Interesting... I guess if you had killed WSL_SOA1 before the messages were picked up the results would be similar to what you expected...

                                 

                                Also, I think it was capable to recover because you are using async.persist as delivery policy on your BPEL...

                                 

                                async.persist (Default)

                                Delivery messages are persisted in the database. With this setting, reliability is obtained with some performance impact on the database. In some cases, overall system performance can be impacted.

                                http://docs.oracle.com/cd/E28280_01/core.1111/e10108/bpel.htm#r2c1-t6

                                 

                                So, once persisted in database the messages are not pinned any more and can be recovered... Well done SOA Engine...

                                 

                                Try repeating the test with these modifications:

                                 

                                1. Insert a delay on message delivery, so they will stay on the queue for long... JMSDeliveryTime header should do... I believe those messages will only be recovered by server migration...

                                http://docs.oracle.com/cd/E28280_01/web.1111/e13727/fund.htm#i1024007

                                 

                                2. Change the delivery policy on your BPEL to sync... I believe this will cause the transaction to be rolled back... Looking forward to see the results on trying to recover...

                                 

                                3. Change the delivery policy on your BPEL to aync.cache... Possibly this will be the worst case where messages may be lost even after server migration...

                                 

                                Cheers,

                                Vlad

                                • 14. Re: Transaction and Message handling in High available Active-Active environment
                                  Sridhar-SOA

                                  Vlad,

                                  Thanks for your time.Below are results for couple of cases.

                                   

                                   

                                  Case 1 : sync

                                   

                                   

                                  1. Changed bpel transaction property to 'sync' , published the messages ( 1,2 ) to DemoOutQueue.

                                  2. Message 1 went to WLS_SOA1 and 2 went to WLS_SOA2. While the bpel was processing the messages ( in for loop) I have killed WLS_SOA2.

                                  3. Message 2 went to state 10 ( non recoverable) in cube_instance. Message 1 processed by WLS_SOA1 successfully as it was up.

                                   

                                   

                                  Now something unexpected happened again - Message 2 got processed by WLS_SOA1 automatically and it got completed ( ie.placed successfuly to DemoInQueue).

                                   

                                   

                                  Case 2 : async.cache

                                   

                                   

                                  1. Changed bpel transaction property to 'async.cache' , published the messages ( 1,2 ) to DemoOutQueue.

                                  2. Message 1 went to WLS_SOA1 and 2 went to WLS_SOA2. While the bpel was processing the messages ( in for loop) I have killed WLS_SOA2.

                                  3. Message 2 went to state 10 ( non recoverable) in cube_instance. Message 1 processed by WLS_SOA1 successfully as it was up.

                                  Nothing else hapenned.! Seems like the new transaction that just got started by the bpel process which was only persisted in in-memory has rolledback and there is no other trace of it anywhere.

                                  4. Brought server WLS_SOA2. Nothing hapenned. ( ie, message or transaction not picked. This seems to be expected)

                                   

                                   

                                   

                                   

                                  Case 3 : introduce a delivery time ( ie., some time after the message is placed but before the consumers can see the message)  - unfortunately I could not do this as my message producer is also a bpel process and bpel doesnt provide assignment to this header property yet.

                                  Vlad, were you asking me to shutdown the server before the delivery time? and why do you beleive that this kind of message ( transaction) will always hve to be recovered by the same server ( or wholse server migration case)?

                                   

                                  Still struggling to practically understand the need for Server migration :-(

                                  1 2 Previous Next