Imagine you encounter an issue with your database and you need to collect relevant trace and diagnostics files, maybe even across several nodes. Sounds familiar? If so, you probably also know how time consuming and complex this task can be.
Well, do you know that there is a tool that can do this task for you? It is called Trace File Analyzer Collector or short TFA and it is applicable to both RAC and non-RAC databases.
As systems supporting today's complex IT environments become larger, more complex and numerous, it can be difficult and time consuming for those managing these systems to know which diagnostics to collect. In addition, systems might be distributed among different data centers. Hence, when an issue occurs, it may take several attempts and iterations before all relevant diagnostics files are uploaded to Oracle Support for troubleshooting.
Before we go into the details of TFA, let's have a look at a traditional diagnostic lifecycle:
Trace File Analyzer Collector (TFA) addresses exactly this issue. It is designed to solve these problems by providing a simple, efficient and thorough mechanism for collecting ALL relevant first failure diagnostic data in one pass as well as useful interactive diagnostic analytics. TFA is started once and collects data from all nodes in the TFA configuration. As long as the approximate time that a problem occurred is known, one simple command can be run that collects all the relevant files from the entire configuration based on that time. The collection will include Oracle Clusterware, Oracle Automatic Storage Management and Oracle Database diagnostic data as well as patch inventory listings and operating system metrics.
Business Drivers for Implementing TFA
Four key business drivers should be considered in a decision to implement Trace File Analyzer Collector.
- Reduced Costs
Any problem that leads or may lead to service interruption costs money and binds resources. The faster a failed service can be restored the sooner normal operation can be resumed and IT staff can get back to more productive activities. Spending time on collecting and uploading diagnostic data over and over again due to the lack of knowledge about what data is needed or because of time pressure, the resolution cycle is lengthened. TFA can help compress this aspect of the cycle.
- Reduced Complexity
Collecting the right data is crucial to efficient analysis of diagnostic data. In moments of distress collecting the relevant data is often replaced by collecting all, which also includes unnecessary diagnostic data. Analyzing unnecessary data means extra circles to determine what was provided. TFA helps eliminate those extra cycles by providing a standardized way of collecting the relevant diagnostic data across all Oracle environments, supporting nearly all configurations. TFA also comes bundled with the Support Tools Bundle which consists of tools often requested by Oracle Support for collecting additional diagnostics. A convenient command line interface is provided for invoking those tools. This approach eliminates the need for users to download and install them separately.
- Increased Quality of Service
Using TFA’s pruning algorithms applied at the time of the collection and upload phase allows Oracle Support to provide much quicker responses and more valuable support. Using TFA’s one file per host transfer accelerates root cause analysis by avoiding multiple communications between Oracle Support and the customer.
- Improved Agility
Using Trace File Analyzer Collector means not having to worry about what data to collect. Once setup and configured it can either be run on a automated event driven basis or on demand, leaving enough time for IT personnel to focus on more productive tasks.TFA enables first level production support personnel to upload diagnostic data just as efficiently and as precisely as senior support staff .
TFA Architecture and Configuration Basics
TFA is implemented by way of a Java Virtual Machine (JVM) that is installed on and runs as a daemon on each host in the TFA configuration. Each TFA JVM communicates with peer TFA JVMs in the configuration through secure sockets. A Berkley Database (BDB) on each host is used to store TFA metadata regarding the configuration, directories and files that TFA monitors on each host.
Supported versions (as of Sept. 2015) of Oracle Clusterware, ASM and Database are 10.2, 11.1, 11.2, 12.1. TFA is shipped and installed in any 184.108.40.206 and 220.127.116.11 Grid Infrastructure Home. If using any other version, TFA must be downloaded and installed separately following the instructions in My Oracle Support Note 1513912.1
Supported platforms (as of Sept. 2015) are Linux, Solaris, AIX, HP-UX
Supported Engineered Systems (Sept. 2015) are Exadata, Zero Data Loss Recovery Appliance (ZDLRA) and the Oracle Database Appliance (ODA). Support on engineered systems includes database and storage servers.
Use the My Oracle Support (MOS) document to download and monitor for updates on version and platform support.
- Document 1513912.1 TFA Collector - Tool for Enhanced Diagnostic Gathering
For supported versions other than 18.104.22.168 and 22.214.171.124, Oracle recommends that TFA be downloaded from the My Oracle Support document and installed as part of a standard build. Oracle also recommends that the version distributed with 126.96.36.199 and 188.8.131.52 be upgraded to the latest version downloadable from My Oracle Support.
Please note, that TFA is also useful for non-RAC single instance database servers as well and can be downloaded from the My Oracle Support document mentioned above and installed manually.
A simple command line interface (CLI) called tfactl is used to execute the commands that are supported by TFA. The JVMs only accept commands from tfactl. tfactl is used for monitoring the status of the TFA configuration, making changes to the configuration and ultimately to take collections.
TFA is a multi-threaded application and its resource footprint is very small. Usually the user would not even be aware that it is running. The only time TFA will consume any noticeable CPU is when doing an inventory or a collection and then only for brief periods.
Collections are initiated from any host in the configuration. Here’s an example which would perform a cluster-wide collection for the last 4 hours for all supported components:
$ tfactl diagcollect
The initiating host’s JVM communicates the collection requirement to peer JVMs on other hosts if files are needed from other hosts in the configuration. All collections across the configuration are run concurrently and copied back to the TFA repository on the initiating node. When all collections are complete, the user can upload them from a single host to an Oracle Service Request.
The TFA repository can be configured on a shared filesystem or locally. In case a shared file system is used, the collection result of each host’s collection is stored in specific subdirectories as soon as the collection is complete. If the TFA repository is configured locally instead (on each host) each collection result is copied to the initiating host’s repository before being deleted from each remote host. In either case it is convenient for the user to obtain the files needed for a Service Request from a single location that is listed at the end of the output for each collection.
Managing TFA is simple. For example, adding or removing nodes to or from the TFA configuration after initial install is simple and only required as configuration such as the cluster topology changes. New databases on the other hand that are added to a configuration will be discovered automatically.
Under normal operating conditions TFA spawns a thread to monitor the end of each alert log in the configuration – Database, ASM, and Clusterware. TFA monitors for certain events such as node or instance evictions, and certain errors such as ORA-00600, ORA-07445, etc. TFA stores metadata about these events in the BDB for future reference.
TFA runs an auto-discovery and a file inventory periodically to keep up to date on databases and diagnostic files. A file inventory is kept for every directory registered in the TFA configuration and metadata about those files is maintained in the BDB. Examples of the metadata kept are file name, first timestamp, last timestamp, file type, etc. Differing timestamp formats are also “normalized” as the format of timestamps can vary from one file type to the next.
|10.2.0.4, 10.2.0.5, 184.108.40.206, 220.127.116.11, 18.104.22.168, 22.214.171.124, 126.96.36.199||RAC||Download latest version from Document 1513912.1|
|188.8.131.52||RAC||Installed by default in $GI_HOME/tfa, $GI_HOME/tfa/bin/tfactl (tfactl -h for options)|
|184.108.40.206||RAC||Installed by default in $GI_HOME/tfa, $GI_HOME/bin/tfactl (tfactl -h for options)|
|220.127.116.11 JUL2014 PSU upwards||RAC||Installed by default in $GI_HOME/tfa, $GI_HOME/bin/tfactl (tfactl -h for options), Analytic functionality added (tfactl -analyze -h for options)|
|18.104.22.168,22.214.171.124,126.96.36.199, 188.8.131.52||Exadata||Download latest version from Document 1513912.1|
|184.108.40.206||Exadata||Database Server collection support only|
|220.127.116.11.6 Bundle Patch (APR2014 PSU)||Exadata||Storage Server collection support added|
|18.104.22.168.9 Bundle Patch (JUL2014 PSU)||Exadata||Analytic functionality added (tfactl -analyze -h for options)|
|22.214.171.124||Exadata||Installed by default in $GI_HOME/tfa, $GI_HOME/bin/tfactl (tfactl -h for options), Database and Storage Server collections and analytic functionality|
Support Tools Bundle
TFA includes tools commonly requested or recommended by Oracle Support as of version 126.96.36.199.0. The tfactl command interface is used to manage the tools. Users can start, stop, status and deploy the tools in a standard way using tfactl commands and without needing to download and install them all separately as in the past. Using the bundled tools from within TFA allows for a standard command interface, deployment locations and collection of output by TFA.
NOTE: Oracle recommends that TFA Collector be kept at the latest version using downloads from My Oracle Support Document 1513912.1 which is updated frequently in order to make new features and other improvements available.
TFA only collects files that are relevant considering the time of the problem. All the user has to know is the approximate time of the problem which can then be provided using a time modifier in the CLI. TFA will then gather all the first failure diagnostic files Oracle Support needs in order to triage the problem. The simplest and most thorough form of collection is for the default time period which is for the last 4 hours
Example #1 : $ tfactl diagcollect
Alternatively, if a more precise time is known, for example a problem which occurred within the last hour
Example #2: $ tfactl diacollect -since 1h
The command shown in Example #1 above initiates a TFA run collecting all the relevant diagnostic data for an issue that occurred within the last four hours. This command will perform a configuration-wide collection of all OS, Clusterware, ASM, Database, Cluster Health Monitor and OS Watcher files that are relevant to that time, because no further restriction such as a specific node or component are mentioned. In addition, TFA will collect other relevant files and configuration data such as patch inventories. Similarly the command shown in Example #2 would collect all the files modified within the last hour, most likely resulting in a smaller and more targeted collection.
tfactl provides a series of command line options to limit the collection to a subset of hosts or components. However, if the root cause is completely unknown, it might be best to simply collect as much data during the relevant time as possible, so nothing will be missed. Reducing the data to collect is not as efficient as being precise about the time of the incident. The more precisely the time of the problem can be specified the smaller the collection will be. Likewise, the more specifically a problem can be narrowed down the more targeted a collection can be in terms of nodes and components.
The TFA inherent pruning of larger files by skipping data from traces and logs that is outside of the time specified is another approach to make data collection as well as analysis more efficient. Again, the more precisely the time can be specified the more the files can be pruned. The design goal of TFA is to make the collections as complete, as relevant and as small as possible to save time in collecting, copying, uploading and transferring files for both the customer and for Oracle Support, following a simple yet very efficient approach; a file that was not modified around the time of the incident is unlikely to be relevant.
When diagnostic collections are taken the first step taken automatically is a "pre-collection" inventory in case any files have been modified or created since the last periodic inventory. In the case of a pre-collection inventory only files for the specified databases and/or components are inventoried.
TFA can be configured to take collections automatically when registered events occur and store them in the repository. Examples of registered events would be node evictions and ORA-00600 errors among others documented in the TFA User Guide. There is a built-in flood control mechanism for the auto-collection feature to prevent a rash of duplicate errors or events from triggering multiple duplicate collections. Should the repository reach its configurable maximum size TFA will suspend taking collections until space is cleared in the repository using the purge command.
Example #3: # tfactl purge -older 30d
In example #3 any collections older than 30 days will be deleted thereby freeing space for new collections. Note that the purge command must be run by root or under sudo control.
TFA has built-in analytics enabling the user to summarize the results from Database, ASM and Clusterware alert logs, system message files and OS Watcher performance metrics across the configuration. Errors, warnings and search patterns can be summarized for any given time period for the entire configuration without needing to access the files on each node of the configuration. In a similar way, OS Watcher data can also be summarized for the entire configuration for any given time period.
For further information, please refer to
- Document 1513912.1: TFA Collector - Tool for Enhanced Diagnostic Gathering
For any help and support with TFA, please come and chat to us via the RAC Community.