This two-part article by Oracle ACE Director Antonis Antoniou compares the differences in fault handling options between 11g and 12c and explores the new error handling and recovery features introduced in Oracle BPM 12c from both a developer’s angle (part 1) and an administrator’s perspective (part 2).
- Read Part 1
- Read Part 2
In Oracle BPM 12c Advanced Error Handling and Recovery - Part 1 we explored three new developer-centric error handling and recovery features in Oracle Business Process Management (BPM) 12c:
- Force Commit on Activity Commit: A declarative feature that configures activities, events and gateways to explicitly force the BPM runtime to add a checkpoint in the dehydration store. This checkpoint commits the state of your BPM instance after an activity is executed to avoid re-executing non-idempotent activities on transaction rollback.
- Skip and Back Error Recovery: Another declarative feature for choosing whether to re-execute a faulted flow object or just skip it and move to the next flow object as defined in your process flow
- Fault Policy Editor: A new graphical editor for creating fault policies.
These developer features are essential, but operations also play an integral part in handling what isn’t handled by the application. In Part 2 of this article, we will put on our administrator’s hat and unveil the important error recovery changes and improvements that Oracle made from an operations and management perspective.
The Oracle Enterprise Manager (OEM) Fusion Middleware Control, the main tool for administrating SOA Suite, has undergone a number of changes and improvements to help administrators troubleshoot and resolve issues. Starting with the SOA Dashboard, it has been completely re-designed to highlight the key areas and exceptions so they can be easily spotted and accessed, focusing on system health and providing a consolidated view on faults and issues. See Figure 1, below:
Figure 1 – SOA Dashboard
The new SOA Dashboard provides seven key views:
Key Configuration: Shows important system configuration settings for the SOA profile, instance tracking, default query duration and auto-purge state (see Figure 2, below). Auto-purge is an entirely new feature in 12c, which comes by default as selected. It allows for scheduled auto-purge of the SOA/BPM database using the Enterprise Scheduler Service (ESS), another new component added in 12c, which provides the ability to schedule and run different job types to optimize runtime environment.
ESS is a powerful scheduler service bundled with Oracle SOA Suite. Its many built-in jobs can be used with various SOA components and can be tied up with many adapters, such as the polling adapter, to bring it up and down at any given frequencies. ESS, as we will see in the Fault Alerts section below, is used with error notification rules as well as by the Error Hospital to perform scheduled throttled bulk recoveries.
Note: The auto-purge feature is not available with the Java Database included with the SOA/BPM Developer Install option. Instead, you can use the truncate_soa_javadb.sql script to purge the database.
All configuration settings have a link for more information, and you can change their parameters as appropriate. The default query duration is the time delimination option that you see across the various query features—such as in the System Backlogs and Business Transaction Faults for which instances and faults are retrieved.Figure 2 - Key Configuration
- SOA Runtime Health: Provides a quick overview of the overall health of the SOA cluster or single-node infrastructure.
- System Backlogs:Show the number of messages in the queues for various message types (e.g., BPEL Invoke, BPEL Callback, Mediator Parallel Routing and EDN WLS JMS). See Figure 3, below.
Note: By default, you do not see any data populated. This new feature improves the responsiveness of the Enterprise Manager. Any data coming from the database is populated only upon request, by clicking on the refresh button.
Figure 3 - System Backlogs
- Business Transaction Faults:Customize the default time period during which to retrieve non-recoverable faults (graphically displayed using a red bar graph), faults requiring recovery (yellow bar graph), recovered faults (green bar graph) and automatically retried system faults (grey bar graph) from the entire SOA Infrastructure or from a specific partition. See Figure 4, below.
Figure 4 - Business Transaction Faults
Data is populated on request when you click on the refresh icon. You can click on the graph to drill down to a specific fault category, where you will be redirected to the Error Hospital. The display filter will be set automatically based on your selection and you can perform bulk recovery and bulk abort.
Quickly search for instances and bulk recovery jobs using either one of the default searches (e.g., search for all instances or for all faulted instances), or use one of your custom saved searches. These searches will take you directly to the flow instances tab to display the results of your search.
Composites and Adapters Availability: Get a quick health check on all your composites, endpoints and adapters filtered by server. Composites that did not start, adapters with connectivity errors and scheduled downtimes for composite and endpoints are clearly marked; you can then expand to debug and hover over the error timestamp to display the fault details. See Figure 6, below:
Figure 6 - Composites and Adapters Availability
- Fault Alerts: New in 12c, you can now view notification alerts in either the entire SOA infrastructure or individual partitions that you have permissions on (see Figure 7, below). You can publish notification alerts, create rules and criteria for notification alerts based on a variety of criteria (e.g., composite you are trying to monitor, number of errors that occurred during a specified time period, a specific fault name, etc.) to monitor your Service Level Agreements (SLAs). Leveraging ESS, you can publish alerts to the SOA dashboard and/or other publication channels (like email, IM and SMS).
If you surface the alert on the SOA dashboard, once an alert is displayed, you can click on the fault link displaying the number of faults to be redirected to the Error Hospital. There, you can perform various actions on grouped instances (e.g., bulk recover and bulk abort.) For further detail, see Bulk Recovery and Bulk Abort, later in this document.
Figure 7 - Fault Alerts
Let’s focus on this new Fault Alert feature and see how you can easily set up your processes for notification alerts.
Each notification rule requires a schedule definition that configures how often a notification rule is executed. This can be achieved using the Job Requests > Define Schedules menu item from the Scheduling Service”menu—see Figure 8, below:
Figure 8 - Define Schedules
Create a new schedule by using the Create button on the Schedules screen. On the Create Schedule page (see Figure 9, below), specify a name for your schedule, a display name and package; make sure to specify the package of /oracle/apps/ess/custom/soa, or the schedule will not be accessible in the Create or Edit Notification Rule page. Choose a frequency for your schedule from such options as Once, Hourly/Minute, Daily, Weekly, Monthly, Yearly or Custom (to manually add scheduled times).
Note: If you choose any option other than Custom you will also have to specify a start date and an optional end date.
Figure 9 - Create Schedule
Lastly, create the actual notification rule by using the Error Notification Rules option from either the soa-infra context menu or the SOA Infrastructure menu if you are on any of the SOA infrastructure pages See Figure 10, below:
Figure 10 - Error Notification Rules
From the Error Notification Rules page, click Create to create your notification rule. Specify a name for your notification and, from the Schedule drop down, select the schedule that you created. (Notice how the page is automatically refreshed to display the schedule description and frequency details). Use the IF-THEN table to define the fault notification rule conditions.
In Figure 11, below, the notification rule uses two conditions: the default fault occurrence time in hours (by default set to 48) and an additional condition to filter only faults coming from the OrderProcess composite. You can also filter faults using the Fault Name, Fault Code, Fault Type, HTTP Host and other filter criteria. In the example below, I’ve chosen to display the faults on the SOA dashboard and also send myself an email notification (which will leverage the User Messaging Service to send the notification).
Figure 11 - Create Error Notification Rule
Once you have applied and saved your notification rule, go to the SOA dashboard and notice how the faults get surfaced under the Fault Alerts section (see Figure 7, above).
The Error Hospital is yet another entirely new view introduced in 12c; it provides a consolidated view of flow instances based on various criteria (e.g., fault name, fault code, HTTP port, etc.). From the Error Hospital you can perform group recovery or abort in a bulk operation—either immediately or at a later time, thus optimizing resources.
Let’s explore this new view in detail.
Figure 12 - Error Hospital
When you directly access the Error Hospital you should notice that, by default, no errors are displayed. This is the on-request feature that was also applied in the SOA dashboard’s System Backlogs and Business Transaction Faults to improve the responsiveness of the Enterprise Manager. So instead of having the Enterprise Manager driving control over the information for presentation, by using smart filters and searches, you can determine what information needs to be displayed.
The overall instance search experience and performance has been overhauled, allowing administrators to use finer-grained queries using a wider spectrum of parameters, including the ability to search based on sensor values. In fact, the underlying architecture has been highly enhanced to improve the performance, visibility and traceability of your end-to-end transactions that now include support for OSB, B2B and MFT.
Clicking on the magnifying glass button will slide, in context, the new robust search palette (see Figure 13, below).
Figure 13 - Instance Search
You can search for faulted instances using a variety of criteria—including time filters (when an instance was created and when a fault occurred), filters on your BPM composite (e.g., partition), and composite name and fault type filters such as faults and recoverable faults.
Furthermore, 12c now utilizes ADF’s full potential—such as the ability to save and bookmark searches—offering a more customizable and personalized experience. Saved searches appear in two places, under the green plus icon next to the reports filter title of the search panel (see Figure 14, below) and under the Search section on the SOA dashboard (see Figure 5, above). The bookmark feature, also available in the report filters toolbar (next to the Save button), lets you generate a bookmark URL for the selected search that you can then share. You will find the same search experience on the Flow Instances page.
Figure 14 - Saved Searches
Navigation and Consolidation of Instances
There are other ways of navigating to the Error Hospital, other than directly using the Error Hospital link tab. You can use one of the pre-defined search or custom search options from the Search section on the SOA dashboard (see Figure 5, above), you can search for instances (again using the SOA dashboard’s Search area), or you can click on one of the fault categories in the Business Transaction Faults section (see Figure 4, above). In any of these cases, you will be redirected to the Error Hospital; the selected filter will be applied, displaying instances that meet the criteria.
Figure 15, below, displays, by fault name, the consolidated instances requiring recovery that you will see if you navigate to the Error Hospital by using the Recovery Required category from Business Transaction Faults:
Figure 15 - Instances Requiring Recovery
Faults are aggregated by one of the fault categories—name, code, type, composite, partition, owner, owner type, JNDI name or HTTP port. The default aggregation is Fault Name.
Note: The Error Hospital does not show individual faulted instances. To track individual faulted flows, use the search facility in the Flow Instances tab or click on a fault count in the fault statistics table of the Error Hospital. This will redirect you to Flow Instances, where individual instances are displayed based on the selected criteria (see Figure 16, below):
Figure 16 - Tracking Individual Fault Instances
Suspended Instances are Now Visible in EM
Another new feature in the 12c release is that suspended process instances are now visible as Suspended in the OEM flow trace—see Figure 17, below. This can be extremely useful when debugging suspended instances. Furthermore, the Suspended state is now available (under the State drop down of the Search facility) to filter on suspended process instances.
Figure 17 - Suspended Instances
Bulk Recovery and Bulk Abort
One of the greatest enhancements in the 12c release is the ability to perform bulk recovery and bulk abort on your error messages. Even more important, Oracle has tied the recovery to a scheduler (ESS) to enable recovery of error messages at scheduled off-peak times. From the Error Hospital you can use the in-build scheduler screens to create your recovery jobs and define a throttling capability (e.g., recover 5 messages every minute) to control how resources will be used. You can monitor the progress of your scheduled recovery jobs from the Enterprise Scheduler Dashboard.
Clicking on Bulk Recovery will bring up the Recovery Request window, where you can change the default request name and specify a start and end date. Using Start Time you can either run the recovery now or defer it. Using the same window, you can define your throttling properties to avoid flooding the system by trying to recover all instances at the same time. For example, as shown in Figure 18, below, you can choose to recover error messages in batches of 2 every 1 minute.
Figure 18 - Recovery Request
Click Yes to create a batch recovery job in the ESS; you’ll get a link to the Enterprise Scheduler where you can monitor the execution of your recovery job—as demonstrated in Figure 19:
Figure 19 - Bulk Recovery Job Link
Click the link to navigate to the Enterprise Scheduler and you will see your job created and waiting to get started.
Figure 20 - Enterprise Scheduler Job
Based on the recovery job definition, the Enterprise Scheduler will attempt to recover the faulted error messages. In my example, I tried to recover 6 faulted messages in batches of 2 every minute. If you have a look at Figure 21, below, you’ll notice that ESS created 3 requests within a 1-minute interval.
Figure 21 - Bulk Recovery Completed
You can further drill down into each request to view the request properties and parameters, the execution trail and status, and the log and output files.
Figure 22 - Recovery Job Request Details
Message and Alarm Recovery Now Available in Flow Trace
Another new feature in 12c is the ability to recover an error message from within the flow trace. In in 11g to recover messages or timers that rolled back you had close the flow trace, navigate to the service engine’s home (either BPEL or BPMN Engine Home), go to the Recovery tab and refresh the alarm table.
In 12c, you can perform message and alarm recovery from the same place that you use to inspect and debug your faulted instance, the flow trace. The Error Message panel (which can be accessed by clicking either on the error message link or the recovery link from the faults table) now includes two new buttons, Retry and Abort, to recover or abort your faulted instance, respectively.
Figure 23 - Recovery from Flow Trace
The Oracle SOA Suite platform, on top of which the BPMN Engine is running, provides an automatic recovery feature in OEM Fusion Middleware Control to configure and recover activities that are not completed and to invoke and callback messages that are unresolved on server restart.
This feature is enabled by default for BPEL instances, but not for BPMN. To enable automatic recovery of BPM instances, go to the bpmn mBean by using the SOA Administration > BPMN Properties menu from the soa-infra home.
Figure 24 - BPMN Properties Menu
From the BPMN Service Engine Properties page, click the More BPMN Configuration Properties link. You’ll be redirected to the bpmn mBean. From there, click the RecoveryConfig attribute to go to the automatic recovery configuration settings.
Figure 25 - RecoveryConfig
To enable automatic recovery, expand the StartupScheduleConfig node and specify a non-zero and non-negative integer value in the startupRecoveryDuration element. After a server restart, the server goes into a recovery period during which pending activities and invocation and undelivered messages are resubmitted for processing. The default value in BPMN is 0, which means no recovery period and thus no automatic recovery.
Define a startup period in seconds that will not be too long but will give enough time for your faulted instances to get recovered. The default BPEL startup recovery duration is 600 (i.e., ten minutes), which you can also use for BPMN automatic recovery.
Figure 26 - Startup Recovery Duration
Unfortunately, this feature cannot be selectively applied, and there is a high risk of ending up with a resource contention, choking your server resources. Extra caution should be paid when using the automatic recovery with BPMN to ensure that there are no side effects before you uptake this feature.
The Force Commit After Execution feature (see Part I) enables better control in the application design to avoid re-invoking non-idempotent activities—but any decision to enable automatic (or even batch) recovery must analyze the reason for failure and make certain that the selected recovery action is appropriate.
Error handling and recovery is vital in Business Process Management and in the software development lifecycle in general. Application developers and system administrators have diverse opinions and perceptions about error handling and recovery:
Administrators: “The system is down. The fault is on your side. Why didn’t you use an error handler in your process design?”
Developers: “I’m not supposed to take care of system faults. That’s your responsibility.”
Application users don’t care about the type of error or who’s responsible for fixing it. To them all errors look the same. And the result is the same. The application does not work!
Oracle BPM 12c bridges this gap by enriching and solidifying the error handling and recovery capabilities of both developers and administrators to tackle the various types of errors covered in this two-part article.
I hope I’ve given you enough information to combine these new development-centric and operation-centric features with the existing error handling and recovery arsenal to develop your error-free future-proof business processes.
- Recovering From Faults in the Error Hospital (from Oracle Fusion Middleware Administering Oracle SOA Suite and Oracle Business Process Management Suite)
About the Author
Oracle ACE Director Antonis Antoniou is a Technical Director with eProseed. He is a Fusion Middleware Expert specializing in Enterprise 2.0, Business Process Management, and Service Oriented Architecture, and has earned certifications in Oracle Application Grid, Oracle WebCenter Portal, Oracle WebCenter Content, ADF, Oracle BPM and Oracle SOA. Antonis has extensive experience as a developer, coach, trainer and architect, and has served as project lead on multiple complex Oracle Fusion Middleware projects across Europe and the Middle East, spanning various industries. Antonis is an avid technology evangelist and a regular speaker at various Oracle conferences and events.
Note:This article has been reviewed by the relevant Oracle product team and found to be in compliance with standards and practices for the use of Oracle products.